Installation on Windows


The following libraries and tools are required to install mmtfPyspark. Choose an installation directory, for example your home directory C:\Users\USER_NAME. This directory is a placeholder for a location of your choice.

Install Anaconda

Download the Python 3.7 Anaconda installer and install Anaconda.

Install Git

The Git version control system is used to download repositories from Github.

Download Git and run the installer (choose all default options)

Install Gow

Gow installs Linux command line tools on Windows. For this install, we will use the curl, gzip, mkdir, mv, and tar tools.

Install Apache Spark

As an example in following steps, _YOUR_DIRECTORY_ could be C:\spark, _YOUR_SPARK_VERSION_ could be spark-2.3.2-bin-hadoop2.7.

NOTE, Spark 2.4.0 does not run on Windows due to a bug!

Launch the Anaconda Prompt command window from the Start Menu and follow the instructions.

  1. Download Apache Spark 2.3.2
    Go to the Apache Spark website link
    1. Choose Spark version 2.3.2
    2. Choose a package type: Pre-build for Apache Hadoop 2.7 and later
    3. Click on the Download Spark link
    4. Unzip the file in your directory:
    mkdir _YOUR_DIRECTORY_
    gzip -d _YOUR_SPARK_VERSION_.tgz
    tar xvf _YOUR_SPARK_VERSION_.tar
  2. Download winutils.exe into ``_YOUR_DIRECTORY__YOUR_SPARK_VERSION_bin`

    curl -k -L -o winutils.exe
    cd ..
  1. Next, set the following environmental variables.


Check the Environment Variables

Close and reopen the Anaconda Prompt to update the environment variables. Type the following commands to check the environment variables.



Install mmtf-pyspark

Create a Conda Environment for mmtf-pyspark

A conda environment is a directory that contains a specific collection of conda packages that you have installed. If you change one environment, your other environments are not affected. You can easily activate or deactivate environments, which is how you switch between them.


git clone

cd mmtf-pyspark

conda env create -f binder/environment.yml

Activate the Conda Environment

conda activate mmtf-pyspark

Test the Installation


If the metadata for 1AQ1 are printed, you have successfully installed mmtf-pyspark.

Launch Jupyter Notebook

jupyter notebook

In Jupyter Notebook, open DataAnalysisExample.ipynb and run it.

Notebooks that demonstrate the use of the mmtf-pypark API are available in the demos directory.

Deactivate the Conda Environment

conda deactivate

Actvate the environment again if you want to use mmtf-pyspark.

Remove the Conda Environment

To permanently remove the environment type:

conda remove -n mmtf-pyspark --all

Download Hadoop Sequence Files

The entire PDB can be downloaded as an MMTF Hadoop sequence file and environmental variables can be set by running the following command:


curl -O
tar -xvf full.tar

curl -O
tar -xvf reduced.tar

Set environmental variables:



Close and reopen the Anaconda Prompt to update the environment variables.