Poly-peptide Chain Statistics Example

Example demonstrating how to extract protein cahins from PDB entries. This example uses a flatMap function to transform a structure to its polymer chains.

Imports

In [1]:
from pyspark import SparkConf, SparkContext
from mmtfPyspark.filters import PolymerComposition
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.mappers import StructureToPolymerChains

Configure Spark

In [2]:
conf = SparkConf().setMaster("local[*]") \
                  .setAppName("polypeptideCahinStats")
sc = SparkContext(conf = conf)

Read in mmtf files, flatMap to polymer chains, filter by polymer composition, and get number of groups

In [4]:
path = "../../resources/mmtf_full_sample/"

chainLengths = mmtfReader.read_sequence_file(path, sc) \
                         .flatMap(StructureToPolymerChains(False, True)) \
                         .filter(PolymerComposition(PolymerComposition.AMINO_ACIDS_20)) \
                         .map(lambda t: t[1].num_groups) \
                         .cache()

Terminate Spark

In [6]:
sc.stop()