{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Custom Report Demo\n", "\n", "This demo shows how to create and query a dataset. The dataset in this case is generated by running an RCSB PDB web service to create a custom report of PDB annotations.\n", "\n", "[PDB custom report](http://www.rcsb.org/pdb/results/reportField.do)\n", "\n", "## Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pyspark import SparkConf, SparkContext, SQLContext\n", "from mmtfPyspark.datasets import customReportService\n", "import time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure Spark" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "conf = SparkConf().setMaster(\"local[*]\") \\\n", " .setAppName(\"secondaryStructureSegmentDemo\")\n", "sc = SparkContext(conf = conf)\n", "sqlContext = SQLContext(sc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrieve PDB annotation:\n", "Binding addinities (Ki, Kd), group name of the ligand (hetId), and the Enzyme Classification number (ecNo)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "ds = customReportService.get_dataset([\"Ki\",\"Kd\",\"hetId\",\"ecNo\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Show the schema of this dataset" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- structureChainId: string (nullable = true)\n", " |-- structureId: string (nullable = true)\n", " |-- chainId: string (nullable = true)\n", " |-- Ki: string (nullable = true)\n", " |-- Kd: string (nullable = true)\n", " |-- hetId: string (nullable = true)\n", " |-- ecNo: string (nullable = true)\n", "\n" ] } ], "source": [ "ds.printSchema()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering\n", "\n", "### Select structures that either have Ki or Kd values(s) and are protein-serine/threonine kinases (EC 2.7.1.*)\n", "\n", "\n", "#### A. By using dataset operations" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------------+-----------+-------+--------------------+------------+-----+---------+\n", "|structureChainId|structureId|chainId| Ki| Kd|hetId| ecNo|\n", "+----------------+-----------+-------+--------------------+------------+-----+---------+\n", "| 2CLQ.A| 2CLQ| A| null|11-120 (BDB)| STU|2.7.11.25|\n", "| 2CLQ.B| 2CLQ| B| null|11-120 (BDB)| STU|2.7.11.25|\n", "| 2E9N.A| 2E9N| A| 6.3 (BDB)| null| 76A| 2.7.11.1|\n", "| 2E9O.A| 2E9O| A| 20 (BDB)| null| A58| 2.7.11.1|\n", "| 2E9P.A| 2E9P| A| 20 (BDB)| null| 77A| 2.7.11.1|\n", "| 2E9U.A| 2E9U| A|7.94 (PDBbind)#7....| null| A25| 2.7.11.1|\n", "| 2E9V.A| 2E9V| A|12.59 (PDBbind)#1...| null| 85A| 2.7.11.1|\n", "| 2E9V.B| 2E9V| B|12.59 (PDBbind)#1...| null| 85A| 2.7.11.1|\n", "| 2GNF.A| 2GNF| A|6000 (BMOAD_9806)...| null| Y27|2.7.11.11|\n", "| 2GNH.A| 2GNH| A|149 (BMOAD_9880)#...| null| H52|2.7.11.11|\n", "+----------------+-----------+-------+--------------------+------------+-----+---------+\n", "only showing top 10 rows\n", "\n" ] } ], "source": [ "ds = ds.filter(\"(Ki IS NOT NULL OR Kd IS NOT NULL) AND ecNo LIKE '2.7.11.%'\")\n", "\n", "ds.show(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### B. By creating a temporary query and running SQL" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------------+-----------+-------+--------------------+------------+-----+---------+\n", "|structureChainId|structureId|chainId| Ki| Kd|hetId| ecNo|\n", "+----------------+-----------+-------+--------------------+------------+-----+---------+\n", "| 2CLQ.A| 2CLQ| A| null|11-120 (BDB)| STU|2.7.11.25|\n", "| 2CLQ.B| 2CLQ| B| null|11-120 (BDB)| STU|2.7.11.25|\n", "| 2E9N.A| 2E9N| A| 6.3 (BDB)| null| 76A| 2.7.11.1|\n", "| 2E9O.A| 2E9O| A| 20 (BDB)| null| A58| 2.7.11.1|\n", "| 2E9P.A| 2E9P| A| 20 (BDB)| null| 77A| 2.7.11.1|\n", "| 2E9U.A| 2E9U| A|7.94 (PDBbind)#7....| null| A25| 2.7.11.1|\n", "| 2E9V.A| 2E9V| A|12.59 (PDBbind)#1...| null| 85A| 2.7.11.1|\n", "| 2E9V.B| 2E9V| B|12.59 (PDBbind)#1...| null| 85A| 2.7.11.1|\n", "| 2GNF.A| 2GNF| A|6000 (BMOAD_9806)...| null| Y27|2.7.11.11|\n", "| 2GNH.A| 2GNH| A|149 (BMOAD_9880)#...| null| H52|2.7.11.11|\n", "+----------------+-----------+-------+--------------------+------------+-----+---------+\n", "only showing top 10 rows\n", "\n" ] } ], "source": [ "ds.createOrReplaceTempView(\"table\")\n", "\n", "ds = sqlContext.sql(\"SELECT * from table WHERE (Ki IS NOT NULL OR Kd IS NOT NULL) AND ecNo LIKE '2.7.11.%'\")\n", "\n", "ds.show(10)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Terminate Spark" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "sc.stop()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }