Installation on Windows

Prerequisites

The following libraries and tools are required to install mmtfPyspark. Choose an installation directory, for example your home directory C:\Users\USER_NAME. This directory is a placeholder for a location of your choice.

Install Anaconda

Download the Python 3.7 Anaconda installer and install Anaconda.

Install Git

The Git version control system is used to download repositories from Github.

Download Git and run the installer (choose all default options)

Install Gow

Gow installs Linux command line tools on Windows. For this install, we will use the curl, gzip, mkdir, mv, and tar tools.

Install Apache Spark

As an example in following steps, _YOUR_DIRECTORY_ could be C:\spark, _YOUR_SPARK_VERSION_ could be spark-2.3.2-bin-hadoop2.7.

NOTE, Spark 2.4.0 does not run on Windows due to a bug!

Launch the Anaconda Prompt command window from the Start Menu and follow the instructions.

  1. Download Apache Spark 2.3.2
    Go to the Apache Spark website link
    1. Choose Spark version 2.3.2
    2. Choose a package type: Pre-build for Apache Hadoop 2.7 and later
    3. Click on the Download Spark link
    4. Unzip the file in your directory:
    mkdir _YOUR_DIRECTORY_
    
    mv _YOUR_SPARK_VERSION_.tgz _YOUR_DIRECTORY_
    
    cd _YOUR_DIRECTORY_
    
    gzip -d _YOUR_SPARK_VERSION_.tgz
    
    tar xvf _YOUR_SPARK_VERSION_.tar
    
  2. Download winutils.exe into ``_YOUR_DIRECTORY__YOUR_SPARK_VERSION_bin`

    cd _YOUR_SPARK_VERSION_\bin
    
    curl -k -L -o winutils.exe https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe?raw=true
    
    cd ..
    
  1. Next, set the following environmental variables.

    setx SPARK_HOME _YOUR_DIRECTORY_\_YOUR_SPARK_VERSION_
    
    setx HADOOP_HOME _YOUR_DIRECTORY_\_YOUR_SPARK_VERSION_
    

Check the Environment Variables

Close and reopen the Anaconda Prompt to update the environment variables. Type the following commands to check the environment variables.

echo %SPARK_HOME%

echo %HADOOP_HOME%

Install mmtf-pyspark

Create a Conda Environment for mmtf-pyspark

A conda environment is a directory that contains a specific collection of conda packages that you have installed. If you change one environment, your other environments are not affected. You can easily activate or deactivate environments, which is how you switch between them.

cd _YOUR_DIRECTORY_

git clone https://github.com/sbl-sdsc/mmtf-pyspark.git

cd mmtf-pyspark

conda env create -f binder/environment.yml

Activate the Conda Environment

conda activate mmtf-pyspark

Test the Installation

python test_mmtfPyspark.py

If the metadata for 1AQ1 are printed, you have successfully installed mmtf-pyspark.

Launch Jupyter Notebook

jupyter notebook

In Jupyter Notebook, open DataAnalysisExample.ipynb and run it.

Notebooks that demonstrate the use of the mmtf-pypark API are available in the demos directory.

Deactivate the Conda Environment

conda deactivate

Actvate the environment again if you want to use mmtf-pyspark.

Remove the Conda Environment

To permanently remove the environment type:

conda remove -n mmtf-pyspark --all

Download Hadoop Sequence Files

The entire PDB can be downloaded as an MMTF Hadoop sequence file and environmental variables can be set by running the following command:

cd _YOUR_DIRECTORY_

curl -O https://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar
tar -xvf full.tar

curl -O https://mmtf.rcsb.org/v1.0/hadoopfiles/reduced.tar
tar -xvf reduced.tar

Set environmental variables:

setx MMTF_FULL _YOUR_DIRECTORY_\full

setx MMTF_REDUCED _YOUR_DIRECTORY_\reduced

Close and reopen the Anaconda Prompt to update the environment variables.