{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create Representative Set Demo\n",
    "\n",
    "Creates an MMTF-Hadoop Sequence file for a Picses representative set of protein chains.\n",
    "\n",
    "\n",
    "## References\n",
    "\n",
    "Please cite the following in any work that uses lists provided by PISCES G. Wang and R. L. Dunbrack, Jr. PISCES: a protein sequence culling server. Bioinformatics, 19:1589-1591, 2003.\n",
    "[PISCES](http://dunbrack.fccc.edu/PISCES.php)\n",
    "\n",
    "\n",
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark import SparkConf, SparkContext\n",
    "from mmtfPyspark.io import mmtfReader, mmtfWriter\n",
    "from mmtfPyspark.mappers import StructureToPolymerChains\n",
    "from mmtfPyspark.filters import PolymerComposition\n",
    "from mmtfPyspark.webfilters import Pisces"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configure Spark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "conf = SparkConf().setMaster(\"local[*]\") \\\n",
    "                  .setAppName(\"CreateRepresentativeSetDemo\")\n",
    "sc = SparkContext(conf = conf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read in Haddop Sequence Files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = \"../../resources/mmtf_full_sample/\"\n",
    "\n",
    "pdb = mmtfReader.read_sequence_file(path, sc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter by representative protein chains at 40% sequence identity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "sequenceIdentity = 40\n",
    "resolution = 2.0\n",
    "\n",
    "pdb = pdb.filter(Pisces(sequenceIdentity, resolution)) \\\n",
    "         .flatMap(StructureToPolymerChains()) \\\n",
    "         .filter(Pisces(sequenceIdentity, resolution)) \\\n",
    "         .filter(PolymerComposition(PolymerComposition.AMINO_ACIDS_20))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Show top 10 structures"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('1FYE.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef707b1a58>),\n",
       " ('1FXL.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef702727f0>),\n",
       " ('1FVI.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef701a12b0>),\n",
       " ('1FV1.F', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef5d39d0f0>),\n",
       " ('1FTR.D', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef5d341048>),\n",
       " ('1FT5.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef43e000b8>),\n",
       " ('1FSG.C', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef43caf0b8>),\n",
       " ('1FS1.C', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef43ad1438>),\n",
       " ('1FR3.L', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef43aa2c18>),\n",
       " ('1FPZ.C', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7fef43a19358>)]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pdb.top(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save representative set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_path = f'./pdb_representatives_{sequenceIdentity}'\n",
    "\n",
    "mmtfWriter.write_sequence_file(write_path, sc, pdb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Terminate Spark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.stop()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}