Simple exmaple of reading an MMTF Hadoop Sequence file, filtering the entries by polymer chain type, L Protein Chain and D Saccharide Chain, and count the number of entires. This example also show show methods can be chained for a more concise syntax
In [1]:
from pyspark import SparkConf, SparkContext
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.filters import *
from mmtfPyspark.structureViewer import view_structure
In [2]:
conf = SparkConf().setMaster("local[*]") \
.setAppName("FilterByPolymerChainType")
sc = SparkContext(conf = conf)
In [3]:
path = "../../resources/mmtf_reduced_sample/"
structures = mmtfReader.read_sequence_file(path, sc) \
.filter(ContainsPolymerChainType("DNA LINKING", ContainsPolymerChainType.RNA_LINKING)) \
.filter(NotFilter(ContainsLProteinChain())) \
.filter(NotFilter(ContainsDSaccharideChain()))
print(f"Number of pure DNA and RNA entires: {structures.count()}")
Number of pure DNA and RNA entires: 227
In [4]:
structure_names = structures.keys().collect()
view_structure(structure_names, style='sphere')
Out[4]:
<function mmtfPyspark.structureViewer.view_structure.<locals>.view3d>
In [5]:
sc.stop()