{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Secondary Structure Blosum62 Encoder Demo\n", "\n", "This demo creates a dataset of sequence segments dericed from a non-redundant set. The dataset contains the sequence segment, the DSSP Q8 and DSSP Q3 code of the center residue in a seuqnce segment, and a Blosum62 encoding of the sequence segment.\n", "\n", "## Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pyspark import SparkConf, SparkContext, SQLContext\n", "from mmtfPyspark.ml import ProteinSequenceEncoder\n", "from mmtfPyspark.mappers import StructureToPolymerChains\n", "from mmtfPyspark.filters import ContainsLProteinChain\n", "from mmtfPyspark.datasets import secondaryStructureSegmentExtractor\n", "from mmtfPyspark.webfilters import Pisces\n", "from mmtfPyspark.io import mmtfReader" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure Spark Context" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "conf = SparkConf() \\\n", " .setMaster(\"local[*]\") \\\n", " .setAppName(\"SecondaryStructureBlosumEncoderDemo\")\n", "\n", "sc = SparkContext(conf = conf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ## Read MMTF Hadoop sequence file and \n", " \n", " Create a non-redundant set(<=20% seq. identity) of L-protein chains" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "path = \"../../resources/mmtf_reduced_sample/\"\n", "sequenceIdentity = 20\n", "resolution = 2.0\n", "fraction = 0.1\n", "seed = 123\n", "\n", "pdb = mmtfReader \\\n", " .read_sequence_file(path, sc) \\\n", " .flatMap(StructureToPolymerChains()) \\\n", " .filter(Pisces(sequenceIdentity, resolution)) \\\n", " .filter(ContainsLProteinChain()) \\\n", " .sample(False, fraction, seed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get content" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "original data : 2149\n" ] } ], "source": [ "segmentLength = 11\n", "data = secondaryStructureSegmentExtractor.get_dataset(pdb, segmentLength).cache()\n", "print(f\"original data : {data.count()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Drop Q3 and sequence duplicates" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "- duplicate Q3/seq : 2149\n" ] } ], "source": [ "data = data.dropDuplicates([\"labelQ3\", \"sequence\"]).cache()\n", "print(f\"- duplicate Q3/seq : {data.count()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Drop sequence duplicates" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "- duplicate seq : 2149\n" ] } ], "source": [ "data = data.dropDuplicates([\"sequence\"])\n", "print(f\"- duplicate seq : {data.count()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Blosum62 Encoding" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- structureChainId: string (nullable = true)\n", " |-- sequence: string (nullable = false)\n", " |-- labelQ8: string (nullable = true)\n", " |-- labelQ3: string (nullable = true)\n", " |-- features: vector (nullable = true)\n", "\n", "+----------------+-----------+-------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", "|structureChainId|sequence |labelQ8|labelQ3|features |\n", "+----------------+-----------+-------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", "|1A9X.F |DIDTRKLTRLL|H |H |[-2.0,-2.0,1.0,6.0,-3.0,0.0,2.0,-1.0,-1.0,-3.0,-4.0,-1.0,-3.0,-3.0,-1.0,0.0,-1.0,-4.0,-3.0,-3.0,-1.0,-3.0,-3.0,-3.0,-1.0,-3.0,-3.0,-4.0,-3.0,4.0,2.0,-3.0,1.0,0.0,-3.0,-2.0,-1.0,-3.0,-1.0,3.0,-2.0,-2.0,1.0,6.0,-3.0,0.0,2.0,-1.0,-1.0,-3.0,-4.0,-1.0,-3.0,-3.0,-1.0,0.0,-1.0,-4.0,-3.0,-3.0,0.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-2.0,-2.0,-1.0,-1.0,-1.0,-1.0,-2.0,-1.0,1.0,5.0,-2.0,-2.0,0.0,-1.0,5.0,0.0,-2.0,-3.0,1.0,0.0,-2.0,0.0,-3.0,-2.0,2.0,-1.0,-3.0,-2.0,-1.0,-1.0,-3.0,-2.0,-3.0,-1.0,2.0,0.0,-1.0,-3.0,1.0,1.0,-2.0,-1.0,-3.0,-2.0,5.0,-1.0,-3.0,-1.0,0.0,-1.0,-3.0,-2.0,-2.0,-1.0,-2.0,-3.0,-4.0,-1.0,-2.0,-3.0,-4.0,-3.0,2.0,4.0,-2.0,2.0,0.0,-3.0,-2.0,-1.0,-2.0,-1.0,1.0,0.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-2.0,-2.0,-1.0,-1.0,-1.0,-1.0,-2.0,-1.0,1.0,5.0,-2.0,-2.0,0.0,-1.0,5.0,0.0,-2.0,-3.0,1.0,0.0,-2.0,0.0,-3.0,-2.0,2.0,-1.0,-3.0,-2.0,-1.0,-1.0,-3.0,-2.0,-3.0,-1.0,-2.0,-3.0,-4.0,-1.0,-2.0,-3.0,-4.0,-3.0,2.0,4.0,-2.0,2.0,0.0,-3.0,-2.0,-1.0,-2.0,-1.0,1.0,-1.0,-2.0,-3.0,-4.0,-1.0,-2.0,-3.0,-4.0,-3.0,2.0,4.0,-2.0,2.0,0.0,-3.0,-2.0,-1.0,-2.0,-1.0,1.0]|\n", "|1FO8.A |DLEVAPDFFEY|T |C |[-2.0,-2.0,1.0,6.0,-3.0,0.0,2.0,-1.0,-1.0,-3.0,-4.0,-1.0,-3.0,-3.0,-1.0,0.0,-1.0,-4.0,-3.0,-3.0,-1.0,-2.0,-3.0,-4.0,-1.0,-2.0,-3.0,-4.0,-3.0,2.0,4.0,-2.0,2.0,0.0,-3.0,-2.0,-1.0,-2.0,-1.0,1.0,-1.0,0.0,0.0,2.0,-4.0,2.0,5.0,-2.0,0.0,-3.0,-3.0,1.0,-2.0,-3.0,-1.0,0.0,-1.0,-3.0,-2.0,-2.0,0.0,-3.0,-3.0,-3.0,-1.0,-2.0,-2.0,-3.0,-3.0,3.0,1.0,-2.0,1.0,-1.0,-2.0,-2.0,0.0,-3.0,-1.0,4.0,4.0,-1.0,-2.0,-2.0,0.0,-1.0,-1.0,0.0,-2.0,-1.0,-1.0,-1.0,-1.0,-2.0,-1.0,1.0,0.0,-3.0,-2.0,0.0,-1.0,-2.0,-2.0,-1.0,-3.0,-1.0,-1.0,-2.0,-2.0,-3.0,-3.0,-1.0,-2.0,-4.0,7.0,-1.0,-1.0,-4.0,-3.0,-2.0,-2.0,-2.0,1.0,6.0,-3.0,0.0,2.0,-1.0,-1.0,-3.0,-4.0,-1.0,-3.0,-3.0,-1.0,0.0,-1.0,-4.0,-3.0,-3.0,-2.0,-3.0,-3.0,-3.0,-2.0,-3.0,-3.0,-3.0,-1.0,0.0,0.0,-3.0,0.0,6.0,-4.0,-2.0,-2.0,1.0,3.0,-1.0,-2.0,-3.0,-3.0,-3.0,-2.0,-3.0,-3.0,-3.0,-1.0,0.0,0.0,-3.0,0.0,6.0,-4.0,-2.0,-2.0,1.0,3.0,-1.0,-1.0,0.0,0.0,2.0,-4.0,2.0,5.0,-2.0,0.0,-3.0,-3.0,1.0,-2.0,-3.0,-1.0,0.0,-1.0,-3.0,-2.0,-2.0,-2.0,-2.0,-2.0,-3.0,-2.0,-1.0,-2.0,-3.0,2.0,-1.0,-1.0,-2.0,-1.0,3.0,-3.0,-2.0,-2.0,2.0,7.0,-1.0] |\n", "|1A9X.F |EDLSSYLKRHN|H |H |[-1.0,0.0,0.0,2.0,-4.0,2.0,5.0,-2.0,0.0,-3.0,-3.0,1.0,-2.0,-3.0,-1.0,0.0,-1.0,-3.0,-2.0,-2.0,-2.0,-2.0,1.0,6.0,-3.0,0.0,2.0,-1.0,-1.0,-3.0,-4.0,-1.0,-3.0,-3.0,-1.0,0.0,-1.0,-4.0,-3.0,-3.0,-1.0,-2.0,-3.0,-4.0,-1.0,-2.0,-3.0,-4.0,-3.0,2.0,4.0,-2.0,2.0,0.0,-3.0,-2.0,-1.0,-2.0,-1.0,1.0,1.0,-1.0,1.0,0.0,-1.0,0.0,0.0,0.0,-1.0,-2.0,-2.0,0.0,-1.0,-2.0,-1.0,4.0,1.0,-3.0,-2.0,-2.0,1.0,-1.0,1.0,0.0,-1.0,0.0,0.0,0.0,-1.0,-2.0,-2.0,0.0,-1.0,-2.0,-1.0,4.0,1.0,-3.0,-2.0,-2.0,-2.0,-2.0,-2.0,-3.0,-2.0,-1.0,-2.0,-3.0,2.0,-1.0,-1.0,-2.0,-1.0,3.0,-3.0,-2.0,-2.0,2.0,7.0,-1.0,-1.0,-2.0,-3.0,-4.0,-1.0,-2.0,-3.0,-4.0,-3.0,2.0,4.0,-2.0,2.0,0.0,-3.0,-2.0,-1.0,-2.0,-1.0,1.0,-1.0,2.0,0.0,-1.0,-3.0,1.0,1.0,-2.0,-1.0,-3.0,-2.0,5.0,-1.0,-3.0,-1.0,0.0,-1.0,-3.0,-2.0,-2.0,-1.0,5.0,0.0,-2.0,-3.0,1.0,0.0,-2.0,0.0,-3.0,-2.0,2.0,-1.0,-3.0,-2.0,-1.0,-1.0,-3.0,-2.0,-3.0,-2.0,0.0,1.0,-1.0,-3.0,0.0,0.0,-2.0,8.0,-3.0,-3.0,-1.0,-2.0,-1.0,-2.0,-1.0,-2.0,-2.0,2.0,-3.0,-2.0,0.0,6.0,1.0,-3.0,0.0,0.0,0.0,1.0,-3.0,-3.0,0.0,-2.0,-3.0,-2.0,1.0,0.0,-4.0,-2.0,-3.0] |\n", "|1EB6.A |GDESKFEEYFK|H |H |[0.0,-2.0,0.0,-1.0,-3.0,-2.0,-2.0,6.0,-2.0,-4.0,-4.0,-2.0,-3.0,-3.0,-2.0,0.0,-2.0,-2.0,-3.0,-3.0,-2.0,-2.0,1.0,6.0,-3.0,0.0,2.0,-1.0,-1.0,-3.0,-4.0,-1.0,-3.0,-3.0,-1.0,0.0,-1.0,-4.0,-3.0,-3.0,-1.0,0.0,0.0,2.0,-4.0,2.0,5.0,-2.0,0.0,-3.0,-3.0,1.0,-2.0,-3.0,-1.0,0.0,-1.0,-3.0,-2.0,-2.0,1.0,-1.0,1.0,0.0,-1.0,0.0,0.0,0.0,-1.0,-2.0,-2.0,0.0,-1.0,-2.0,-1.0,4.0,1.0,-3.0,-2.0,-2.0,-1.0,2.0,0.0,-1.0,-3.0,1.0,1.0,-2.0,-1.0,-3.0,-2.0,5.0,-1.0,-3.0,-1.0,0.0,-1.0,-3.0,-2.0,-2.0,-2.0,-3.0,-3.0,-3.0,-2.0,-3.0,-3.0,-3.0,-1.0,0.0,0.0,-3.0,0.0,6.0,-4.0,-2.0,-2.0,1.0,3.0,-1.0,-1.0,0.0,0.0,2.0,-4.0,2.0,5.0,-2.0,0.0,-3.0,-3.0,1.0,-2.0,-3.0,-1.0,0.0,-1.0,-3.0,-2.0,-2.0,-1.0,0.0,0.0,2.0,-4.0,2.0,5.0,-2.0,0.0,-3.0,-3.0,1.0,-2.0,-3.0,-1.0,0.0,-1.0,-3.0,-2.0,-2.0,-2.0,-2.0,-2.0,-3.0,-2.0,-1.0,-2.0,-3.0,2.0,-1.0,-1.0,-2.0,-1.0,3.0,-3.0,-2.0,-2.0,2.0,7.0,-1.0,-2.0,-3.0,-3.0,-3.0,-2.0,-3.0,-3.0,-3.0,-1.0,0.0,0.0,-3.0,0.0,6.0,-4.0,-2.0,-2.0,1.0,3.0,-1.0,-1.0,2.0,0.0,-1.0,-3.0,1.0,1.0,-2.0,-1.0,-3.0,-2.0,5.0,-1.0,-3.0,-1.0,0.0,-1.0,-3.0,-2.0,-2.0] |\n", "|1C1K.A |IISFETFILLD|H |H |[-1.0,-3.0,-3.0,-3.0,-1.0,-3.0,-3.0,-4.0,-3.0,4.0,2.0,-3.0,1.0,0.0,-3.0,-2.0,-1.0,-3.0,-1.0,3.0,-1.0,-3.0,-3.0,-3.0,-1.0,-3.0,-3.0,-4.0,-3.0,4.0,2.0,-3.0,1.0,0.0,-3.0,-2.0,-1.0,-3.0,-1.0,3.0,1.0,-1.0,1.0,0.0,-1.0,0.0,0.0,0.0,-1.0,-2.0,-2.0,0.0,-1.0,-2.0,-1.0,4.0,1.0,-3.0,-2.0,-2.0,-2.0,-3.0,-3.0,-3.0,-2.0,-3.0,-3.0,-3.0,-1.0,0.0,0.0,-3.0,0.0,6.0,-4.0,-2.0,-2.0,1.0,3.0,-1.0,-1.0,0.0,0.0,2.0,-4.0,2.0,5.0,-2.0,0.0,-3.0,-3.0,1.0,-2.0,-3.0,-1.0,0.0,-1.0,-3.0,-2.0,-2.0,0.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-2.0,-2.0,-1.0,-1.0,-1.0,-1.0,-2.0,-1.0,1.0,5.0,-2.0,-2.0,0.0,-2.0,-3.0,-3.0,-3.0,-2.0,-3.0,-3.0,-3.0,-1.0,0.0,0.0,-3.0,0.0,6.0,-4.0,-2.0,-2.0,1.0,3.0,-1.0,-1.0,-3.0,-3.0,-3.0,-1.0,-3.0,-3.0,-4.0,-3.0,4.0,2.0,-3.0,1.0,0.0,-3.0,-2.0,-1.0,-3.0,-1.0,3.0,-1.0,-2.0,-3.0,-4.0,-1.0,-2.0,-3.0,-4.0,-3.0,2.0,4.0,-2.0,2.0,0.0,-3.0,-2.0,-1.0,-2.0,-1.0,1.0,-1.0,-2.0,-3.0,-4.0,-1.0,-2.0,-3.0,-4.0,-3.0,2.0,4.0,-2.0,2.0,0.0,-3.0,-2.0,-1.0,-2.0,-1.0,1.0,-2.0,-2.0,1.0,6.0,-3.0,0.0,2.0,-1.0,-1.0,-3.0,-4.0,-1.0,-3.0,-3.0,-1.0,0.0,-1.0,-4.0,-3.0,-3.0] |\n", "+----------------+-----------+-------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "encoder = ProteinSequenceEncoder(data)\n", "data = encoder.blosum62_encode()\n", "\n", "data.printSchema()\n", "data.show(5, False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Terminate Spark Context" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "sc.stop()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }