{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SIFTS Data Demo\n", "\n", "This demo shows how to query PDB annotations from the SIFTS project.\n", "\n", "The \"Structure Integration with Function, Taxonomy and Sequence\" is the authoritative source of up-to-date residue-level annotation of structures in the PDB with data available in Uniprot, IntEnz, CATH, SCOP, GO, InterPro, Pfam and PubMed. [link to SIFTS](https://www.ebi.ac.uk/pdbe/docs/sifts/overview.html)\n", "\n", "Data are probided through [Mine 2 SQL](https://pdbj.org/help/mine2-sql)\n", "\n", "Queries can be designed with the interactive [PDBj Mine 2 query service](https://pdbj.org/help/mine2-sql)\n", "\n", "## Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "from pyspark.sql.functions import col\n", "from mmtfPyspark.datasets import pdbjMineDataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure Spark" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "spark = SparkSession.builder\\\n", " .master(\"local[*]\")\\\n", " .appName(\"SIFTSDataDemo\")\\\n", " .getOrCreate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get PDB entry to PubMed Id mappings" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 10 results for query: SELECT * FROM sifts.pdb_pubmed LIMIT 10\n", "+-----------+-------+---------+\n", "|structureId|ordinal|pubmed_id|\n", "+-----------+-------+---------+\n", "| 100D| 0| 7816639|\n", "| 101D| 0| 7711020|\n", "| 102D| 0| 7608897|\n", "| 102L| 0| 8429913|\n", "| 103D| 0| 7966337|\n", "| 103L| 0| 8429913|\n", "| 104D| 0| 7857947|\n", "| 104L| 0| 8429913|\n", "| 105D| 0| 7743125|\n", "| 106D| 0| 7743125|\n", "+-----------+-------+---------+\n", "\n" ] } ], "source": [ "pubmedQuery = \"SELECT * FROM sifts.pdb_pubmed LIMIT 10\"\n", "pubmed = pdbjMineDataset.get_dataset(pubmedQuery)\n", "print(f\"First 10 results for query: {pubmedQuery}\")\n", "pubmed.show(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get PDB chain to InterPro mappings" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 10 results for query: SELECT * FROM sifts.pdb_chain_interpro LIMIT 10\n", "+-----+-----+-----------+----------------+\n", "|pdbid|chain|interpro_id|structureChainId|\n", "+-----+-----+-----------+----------------+\n", "| 101M| A| IPR000971| 101M.A|\n", "| 101M| A| IPR002335| 101M.A|\n", "| 102L| A| IPR001165| 102L.A|\n", "| 102L| A| IPR002196| 102L.A|\n", "| 102L| A| IPR034690| 102L.A|\n", "| 102M| A| IPR000971| 102M.A|\n", "| 102M| A| IPR002335| 102M.A|\n", "| 103L| A| IPR001165| 103L.A|\n", "| 103L| A| IPR002196| 103L.A|\n", "| 103L| A| IPR034690| 103L.A|\n", "+-----+-----+-----------+----------------+\n", "\n" ] } ], "source": [ "interproQuery = \"SELECT * FROM sifts.pdb_chain_interpro LIMIT 10\"\n", "interpro = pdbjMineDataset.get_dataset(interproQuery)\n", "print(f\"First 10 results for query: {interproQuery}\")\n", "interpro.show(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get PDB chain to UniProt mappings" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 10 results for query: SELECT * FROM sifts.pdb_chain_uniprot LIMIT 10\n", "+-----+-----+----------+-------+-------+-------+-------+------+------+----------------+\n", "|pdbid|chain|sp_primary|res_beg|res_end|pdb_beg|pdb_end|sp_beg|sp_end|structureChainId|\n", "+-----+-----+----------+-------+-------+-------+-------+------+------+----------------+\n", "| 101M| A| P02185| 1| 154| 0| 153| 1| 154| 101M.A|\n", "| 102L| A| P00720| 1| 40| 1| 40| 1| 40| 102L.A|\n", "| 102L| A| P00720| 42| 165| 41| None| 41| 164| 102L.A|\n", "| 102M| A| P02185| 1| 154| 0| 153| 1| 154| 102M.A|\n", "| 103L| A| P00720| 1| 40| 1| None| 1| 40| 103L.A|\n", "| 103L| A| P00720| 44| 167| 41| None| 41| 164| 103L.A|\n", "| 103M| A| P02185| 1| 154| 0| 153| 1| 154| 103M.A|\n", "| 104L| A| P00720| 1| 44| 1| 44| 1| 44| 104L.A|\n", "| 104L| A| P00720| 47| 166| 45| None| 45| 164| 104L.A|\n", "| 104L| B| P00720| 1| 44| 1| 44| 1| 44| 104L.B|\n", "+-----+-----+----------+-------+-------+-------+-------+------+------+----------------+\n", "\n" ] } ], "source": [ "uniprotQuery = \"SELECT * FROM sifts.pdb_chain_uniprot LIMIT 10\"\n", "uniprot = pdbjMineDataset.get_dataset(uniprotQuery)\n", "print(f\"First 10 results for query: {uniprotQuery}\")\n", "uniprot.show(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get PDB chain to taxonomy mappings" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 10 results for query: SELECT * FROM sifts.pdb_chain_taxonomy LIMIT 10\n", "+-----+-----+------+--------------------+----------------+\n", "|pdbid|chain|tax_id| scientific_name|structureChainId|\n", "+-----+-----+------+--------------------+----------------+\n", "| 101M| A| 9755| PHYCD| 101M.A|\n", "| 101M| A| 9755| Physeter catodon| 101M.A|\n", "| 101M| A| 9755|Physeter catodon ...| 101M.A|\n", "| 101M| A| 9755|Physeter catodon ...| 101M.A|\n", "| 101M| A| 9755|Physeter macrocep...| 101M.A|\n", "| 101M| A| 9755| Sperm whale| 101M.A|\n", "| 101M| A| 9755| sperm whale| 101M.A|\n", "| 102L| A| 10665| BPT4| 102L.A|\n", "| 102L| A| 10665| Bacteriophage T4| 102L.A|\n", "| 102L| A| 10665|Enterobacteria ph...| 102L.A|\n", "+-----+-----+------+--------------------+----------------+\n", "\n" ] } ], "source": [ "taxonomyQuery = \"SELECT * FROM sifts.pdb_chain_taxonomy LIMIT 10\"\n", "taxonomy = pdbjMineDataset.get_dataset(taxonomyQuery)\n", "print(f\"First 10 results for query: {taxonomyQuery}\")\n", "taxonomy.show(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get PDB chain to PFAM mappings" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 10 results for query: SELECT * FROM sifts.pdb_chain_pfam LIMIT 10\n", "+-----+-----+----------+-------+----------------+\n", "|pdbid|chain|sp_primary|pfam_id|structureChainId|\n", "+-----+-----+----------+-------+----------------+\n", "| 101M| A| P02185|PF00042| 101M.A|\n", "| 102L| A| P00720|PF00959| 102L.A|\n", "| 102M| A| P02185|PF00042| 102M.A|\n", "| 103L| A| P00720|PF00959| 103L.A|\n", "| 103M| A| P02185|PF00042| 103M.A|\n", "| 104L| A| P00720|PF00959| 104L.A|\n", "| 104L| B| P00720|PF00959| 104L.B|\n", "| 104M| A| P02185|PF00042| 104M.A|\n", "| 105M| A| P02185|PF00042| 105M.A|\n", "| 106M| A| P02185|PF00042| 106M.A|\n", "+-----+-----+----------+-------+----------------+\n", "\n" ] } ], "source": [ "pfamQuery = \"SELECT * FROM sifts.pdb_chain_pfam LIMIT 10\"\n", "pfam = pdbjMineDataset.get_dataset(pfamQuery)\n", "print(f\"First 10 results for query: {pfamQuery}\")\n", "pfam.show(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get PDB chain to CATH mappings" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 10 results for query: SELECT * FROM sifts.pdb_pubmed LIMIT 10\n", "+-----------+-------+---------+\n", "|structureId|ordinal|pubmed_id|\n", "+-----------+-------+---------+\n", "| 100D| 0| 7816639|\n", "| 101D| 0| 7711020|\n", "| 102D| 0| 7608897|\n", "| 102L| 0| 8429913|\n", "| 103D| 0| 7966337|\n", "| 103L| 0| 8429913|\n", "| 104D| 0| 7857947|\n", "| 104L| 0| 8429913|\n", "| 105D| 0| 7743125|\n", "| 106D| 0| 7743125|\n", "+-----------+-------+---------+\n", "\n" ] } ], "source": [ "pubmedQuery = \"SELECT * FROM sifts.pdb_pubmed LIMIT 10\"\n", "pubmed = pdbjMineDataset.get_dataset(pubmedQuery)\n", "print(f\"First 10 results for query: {pubmedQuery}\")\n", "pubmed.show(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get PDB chain to SCOP mappings" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 10 results for query: SELECT * FROM sifts.pdb_chain_scop_uniprot LIMIT 10\n", "+-----+-----+----------+-----+-------+----------------+\n", "|pdbid|chain|sp_primary|sunid|scop_id|structureChainId|\n", "+-----+-----+----------+-----+-------+----------------+\n", "| 101M| A| P02185|15125|d101ma_| 101M.A|\n", "| 102L| A| P00720|36724|d102la_| 102L.A|\n", "| 102M| A| P02185|15073|d102ma_| 102M.A|\n", "| 103L| A| P00720|36870|d103la_| 103L.A|\n", "| 103M| A| P02185|15124|d103ma_| 103M.A|\n", "| 104L| A| P00720|36969|d104la_| 104L.A|\n", "| 104L| B| P00720|36970|d104lb_| 104L.B|\n", "| 104M| A| P02185|15041|d104ma_| 104M.A|\n", "| 105M| A| P02185|15128|d105ma_| 105M.A|\n", "| 106M| A| P02185|15101|d106ma_| 106M.A|\n", "+-----+-----+----------+-----+-------+----------------+\n", "\n" ] } ], "source": [ "scopQuery = \"SELECT * FROM sifts.pdb_chain_scop_uniprot LIMIT 10\"\n", "scop = pdbjMineDataset.get_dataset(scopQuery)\n", "print(f\"First 10 results for query: {scopQuery}\")\n", "scop.show(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get PDB chain to Enzyme classification (EC) mappings" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 10 results for query: SELECT * FROM sifts.pdb_chain_enzyme LIMIT 10\n", "+-----+-----+---------+---------+----------------+\n", "|pdbid|chain|accession|ec_number|structureChainId|\n", "+-----+-----+---------+---------+----------------+\n", "| 102L| A| P00720| 3.2.1.17| 102L.A|\n", "| 103L| A| P00720| 3.2.1.17| 103L.A|\n", "| 104L| A| P00720| 3.2.1.17| 104L.A|\n", "| 104L| B| P00720| 3.2.1.17| 104L.B|\n", "| 107L| A| P00720| 3.2.1.17| 107L.A|\n", "| 108L| A| P00720| 3.2.1.17| 108L.A|\n", "| 109L| A| P00720| 3.2.1.17| 109L.A|\n", "| 10GS| A| P09211| 2.5.1.18| 10GS.A|\n", "| 10GS| B| P09211| 2.5.1.18| 10GS.B|\n", "| 10MH| A| P05102| 2.1.1.37| 10MH.A|\n", "+-----+-----+---------+---------+----------------+\n", "\n" ] } ], "source": [ "enzymeQuery = \"SELECT * FROM sifts.pdb_chain_enzyme LIMIT 10\"\n", "enzyme = pdbjMineDataset.get_dataset(enzymeQuery)\n", "print(f\"First 10 results for query: {enzymeQuery}\")\n", "enzyme.show(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get PDB chain to Gene Ontology term mappings" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 10 results for query: SELECT * FROM sifts.pdb_chain_go LIMIT 10\n", "+-----+-----+----------+--------------------+--------+----------+----------------+\n", "|pdbid|chain|sp_primary| with_string|evidence| go_id|structureChainId|\n", "+-----+-----+----------+--------------------+--------+----------+----------------+\n", "| 107L| A| P00720| n| IDA|GO:0003796| 107L.A|\n", "| 108L| A| P00720| n| IDA|GO:0003796| 108L.A|\n", "| 109L| A| P00720| n| IDA|GO:0003796| 109L.A|\n", "| 10GS| A| P09211| GO:0005515| IC|GO:0010804| 10GS.A|\n", "| 10GS| A| P09211|UniProtKB:P19712:...| IPI|GO:0005515| 10GS.A|\n", "| 10GS| A| P09211| UniProtKB:P30041| IPI|GO:0005515| 10GS.A|\n", "| 10GS| A| P09211| UniProtKB:P60409| IPI|GO:0005515| 10GS.A|\n", "| 10GS| A| P09211| UniProtKB:P60411| IPI|GO:0005515| 10GS.A|\n", "| 10GS| A| P09211| UniProtKB:Q12933| IPI|GO:0005515| 10GS.A|\n", "| 10GS| A| P09211| UniProtKB:Q15323| IPI|GO:0005515| 10GS.A|\n", "+-----+-----+----------+--------------------+--------+----------+----------------+\n", "\n" ] } ], "source": [ "goQuery = \"SELECT * FROM sifts.pdb_chain_go LIMIT 10\"\n", "go = pdbjMineDataset.get_dataset(goQuery)\n", "print(f\"First 10 results for query: {goQuery}\")\n", "go.show(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Terminate Spark" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "spark.stop()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }