The following libraries and tools are required to install mmtfPyspark.
Choose an installation directory, for example your home directory C:\Users\USER_NAME
. This directory is a placeholder for a location of your choice.
Download the Python 3.7 Anaconda installer and install Anaconda.
The Git version control system is used to download repositories from Github.
Download Git and run the installer (choose all default options)
Gow installs Linux command line tools on Windows. For this install, we will use the curl, gzip, mkdir, mv, and tar tools.
As an example in following steps, _YOUR_DIRECTORY_
could be
C:\spark
, _YOUR_SPARK_VERSION_
could be
spark-2.3.2-bin-hadoop2.7
.
NOTE, Spark 2.4.0 does not run on Windows due to a bug!
Launch the Anaconda Prompt command window from the Start Menu and follow the instructions.
mkdir _YOUR_DIRECTORY_
mv _YOUR_SPARK_VERSION_.tgz _YOUR_DIRECTORY_
cd _YOUR_DIRECTORY_
gzip -d _YOUR_SPARK_VERSION_.tgz
tar xvf _YOUR_SPARK_VERSION_.tar
Download winutils.exe into ``_YOUR_DIRECTORY__YOUR_SPARK_VERSION_bin`
cd _YOUR_SPARK_VERSION_\bin
curl -k -L -o winutils.exe https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe?raw=true
cd ..
Next, set the following environmental variables.
setx SPARK_HOME _YOUR_DIRECTORY_\_YOUR_SPARK_VERSION_
setx HADOOP_HOME _YOUR_DIRECTORY_\_YOUR_SPARK_VERSION_
Close and reopen the Anaconda Prompt to update the environment variables. Type the following commands to check the environment variables.
echo %SPARK_HOME%
echo %HADOOP_HOME%
A conda environment is a directory that contains a specific collection of conda packages that you have installed. If you change one environment, your other environments are not affected. You can easily activate or deactivate environments, which is how you switch between them.
cd _YOUR_DIRECTORY_
git clone https://github.com/sbl-sdsc/mmtf-pyspark.git
cd mmtf-pyspark
conda env create -f binder/environment.yml
conda activate mmtf-pyspark
python test_mmtfPyspark.py
If the metadata for 1AQ1 are printed, you have successfully installed mmtf-pyspark.
jupyter notebook
In Jupyter Notebook, open DataAnalysisExample.ipynb
and run it.
Notebooks that demonstrate the use of the mmtf-pypark API are available in the demos
directory.
conda deactivate
Actvate the environment again if you want to use mmtf-pyspark.
To permanently remove the environment type:
conda remove -n mmtf-pyspark --all
The entire PDB can be downloaded as an MMTF Hadoop sequence file and environmental variables can be set by running the following command:
cd _YOUR_DIRECTORY_
curl -O https://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar
tar -xvf full.tar
curl -O https://mmtf.rcsb.org/v1.0/hadoopfiles/reduced.tar
tar -xvf reduced.tar
Set environmental variables:
setx MMTF_FULL _YOUR_DIRECTORY_\full
setx MMTF_REDUCED _YOUR_DIRECTORY_\reduced
Close and reopen the Anaconda Prompt to update the environment variables.