Create Representative Set Demo

Creates an MMTF-Hadoop Sequence file for a Picses representative set of protein chains.

References

Please cite the following in any work that uses lists provided by PISCES G. Wang and R. L. Dunbrack, Jr. PISCES: a protein sequence culling server. Bioinformatics, 19:1589-1591, 2003. PISCES

Imports

In [1]:
from pyspark import SparkConf, SparkContext
from mmtfPyspark.io import mmtfReader, mmtfWriter
from mmtfPyspark.mappers import StructureToPolymerChains
from mmtfPyspark.filters import PolymerComposition
from mmtfPyspark.webfilters import Pisces

Configure Spark

In [2]:
conf = SparkConf().setMaster("local[*]") \
                  .setAppName("CreateRepresentativeSetDemo")
sc = SparkContext(conf = conf)

Read in Haddop Sequence Files

In [6]:
path = "../../resources/mmtf_full_sample/"

pdb = mmtfReader.read_sequence_file(path, sc)

Filter by representative protein chains at 40% sequence identity

In [7]:
sequenceIdentity = 40
resolution = 2.0

pdb = pdb.filter(Pisces(sequenceIdentity, resolution)) \
         .flatMap(StructureToPolymerChains()) \
         .filter(Pisces(sequenceIdentity, resolution)) \
         .filter(PolymerComposition(PolymerComposition.AMINO_ACIDS_20))

Show top 10 structures

In [8]:
pdb.top(10)
Out[8]:
[('1FYE.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef707b1a58>),
 ('1FXL.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef702727f0>),
 ('1FVI.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef701a12b0>),
 ('1FV1.F', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef5d39d0f0>),
 ('1FTR.D', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef5d341048>),
 ('1FT5.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef43e000b8>),
 ('1FSG.C', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef43caf0b8>),
 ('1FS1.C', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef43ad1438>),
 ('1FR3.L', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef43aa2c18>),
 ('1FPZ.C', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef43a19358>)]

Save representative set

In [9]:
write_path = f'./pdb_representatives_{sequenceIdentity}'

mmtfWriter.write_sequence_file(write_path, sc, pdb)

Terminate Spark

In [3]:
sc.stop()