{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# PDB Meta Data Demo\n", "\n", "This demo shows how to query metadata from the PDB archive.\n", "\n", "This exmaple queries the \\_citation category. Each category represents a table, and fields represent database columns. [Avalible tables and columns](https://pdbj.org/mine-rdb-docs)\n", "\n", "Example data from 100D.cif\n", " * _citation.id primary \n", " * _citation.title Crystal structure of ...\n", " * _citation.journal_abbrev 'Nucleic Acids Res.' \n", " * _citation.journal_volume 22 \n", " * _citation.page_first 5466 \n", " * _citation.page_last 5476 \n", " * _citation.year 1994 \n", " * _citation.journal_id_ASTM NARHAD \n", " * _citation.country UK \n", " * _citation.journal_id_ISSN 0305-1048 \n", " * _citation.journal_id_CSD 0389 \n", " * _citation.book_publisher ? \n", " * _citation.pdbx_database_id_PubMed 7816639 \n", " * _citation.pdbx_database_id_DOI 10.1093/nar/22.24.5466 \n", "\n", "Data are probided through [Mine 2 SQL](https://pdbj.org/help/mine2-sql)\n", "\n", "Queries can be designed with the interactive [PDBj Mine 2 query service](https://pdbj.org/help/mine2-sql)\n", "\n", "## Imports" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "from pyspark.sql.functions import col\n", "from mmtfPyspark.datasets import pdbjMineDataset\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure Spark" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "spark = SparkSession.builder\\\n", " .master(\"local[*]\")\\\n", " .appName(\"PDBMetaDataDemo\")\\\n", " .getOrCreate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Query PDBj Mine\n", "\n", "Query the following fields from the \\citation category using PDBj's Mine 2 web service:\n", " * journal_abbrev\n", " * pdbx_database_id_PubMed\n", " * year\n", "\n", "Note: mixed case column names must be quoted and escaped with \\\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "sqlQuery = \"SELECT pdbid, journal_abbrev, \\\"pdbx_database_id_PubMed\\\", year from citation WHERE id = 'primary'\"\n", "\n", "ds = pdbjMineDataset.get_dataset(sqlQuery)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Show first 10 results from query" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----------+------------------+-----------------------+----+\n", "|structureId|journal_abbrev |pdbx_database_id_PubMed|year|\n", "+-----------+------------------+-----------------------+----+\n", "|100D |Nucleic Acids Res.|7816639 |1994|\n", "|101D |Biochemistry |7711020 |1995|\n", "|101M |Thesis, Rice |-1 |1999|\n", "|102D |J.Med.Chem. |7608897 |1995|\n", "|102L |Nature |8429913 |1993|\n", "|102M |Thesis, Rice |-1 |1999|\n", "|103D |J.Mol.Biol. |7966337 |1994|\n", "|103L |Nature |8429913 |1993|\n", "|103M |Thesis, Rice |-1 |1999|\n", "|104D |Biochemistry |7857947 |1995|\n", "+-----------+------------------+-----------------------+----+\n", "only showing top 10 rows\n", "\n" ] } ], "source": [ "ds.show(10, False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filter out unpublished entries\n", "\n", "Published entires contain the word \"published\" in various upper/lower case combinations" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "ds = ds.filter(\"UPPER(journal_abbrev) NOT LIKE '%PUBLISHED%'\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Show the top 10 journals that publish PDB structures" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------------------+-----+\n", "|journal_abbrev |count|\n", "+------------------------+-----+\n", "|J.Biol.Chem. |10479|\n", "|J.Mol.Biol. |10311|\n", "|Biochemistry |9877 |\n", "|Proc.Natl.Acad.Sci.USA |5633 |\n", "|Structure |5191 |\n", "|Acta Crystallogr.,Sect.D|4253 |\n", "|J.Med.Chem. |3871 |\n", "|Nature |3263 |\n", "|Nat Commun |2519 |\n", "|Protein Sci. |2374 |\n", "+------------------------+-----+\n", "only showing top 10 rows\n", "\n" ] } ], "source": [ "ds.groupBy(\"journal_abbrev\").count().sort(col(\"count\").desc()).show(10,False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filter out entries without a PubMed Id \n", "\n", "-1 if PubMed Id is not available" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Entires with PubMed Ids: 113405\n" ] } ], "source": [ "ds = ds.filter(\"pdbx_database_id_PubMed > 0\")\n", "\n", "print(f\"Entires with PubMed Ids: {ds.count()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Show growth of papers in PubMed" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PubMed Ids per year: \n", "+----+-----+\n", "|year|count|\n", "+----+-----+\n", "|2018|1386 |\n", "|2017|8907 |\n", "|2016|8806 |\n", "|2015|8187 |\n", "|2014|7556 |\n", "|2013|7693 |\n", "|2012|7197 |\n", "|2011|6178 |\n", "|2010|6040 |\n", "|2009|5522 |\n", "+----+-----+\n", "only showing top 10 rows\n", "\n" ] } ], "source": [ "print(\"PubMed Ids per year: \")\n", "idsPerYear = ds.groupBy(\"year\").count().sort(col(\"year\").desc())\n", "idsPerYear.show(10, False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Make scatter plot for growth of papers in PubMed" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5,1,'Growth of papers in PubMed each year')" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZUAAAEWCAYAAACufwpNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvFvnyVgAAIABJREFUeJzt3Xu8XFV99/HPlxD0AJbDJUVyIASfIhRMMXAElFYBlQBekgdbRRFToeVptYKtRUntU/CCUbG1+vhopdwCWqkFGmhLjakCFhDkhIvcjAQMJIcglySAJEICv/6x1pDJYfZk5pw958zl+3695nVm1r7MWnvm7N+sy15bEYGZmVkZtproDJiZWfdwUDEzs9I4qJiZWWkcVMzMrDQOKmZmVhoHFTMzK42DijVF0rWS/qikfR0m6T5Jv5I0p4x9TjRJfyXpvAl43+WS3jLO7/mHkq4f5/cs7ftnreGg0gEkHS/pZknPSHo0P/+QJLX4fc+S9K0WvsWnga9FxPYRsbCF7zNuIuJzETGqk56kiyQ9l4PsakmLJe071jxJOlxSSPrXEekH5PRrx/oeZhUOKm1O0seArwDnAK8EdgX+BDgM2KZgm0njlsGx2RO4e6IzUY+S8fw/+WJEbA/sDjwKXFTSfh8DXi9p56q0ucDPS9p/z5K09UTnoZ04qLQxSTuQfs1/KCIui4inI7ktIk6IiGfzehdJ+oakqyU9AxwhaQdJF0t6TNKDkv66cnLMrw/Kz0/Iv1b3z69PlrRQ0tHAXwHvyb+c76jK2p6SbpD0tKTvS9qlThn+WNKy/Mv7KklTc/r9wKuAf8v7f1mNbZdLmifpHklrJF0o6eV52Y6S/j2Xb01+vnvVttdKmi/pJ5KeknSlpJ2qlh8q6UZJayXdIenwEdueLekGYB3wqtzU80Au8y8knVBQ3hdrd5Km52M7V9JDkh6X9MnCD7xKRKwD/gl4Td7XRZI+W/U+h0taOWKz19U6VtlzwELg+Lz9JOA9wLdH5H/fXENaLWmppHdXLds5f4ZPSfoJ8L/qlWELx/iDku7Nx/MBSf9nxLazJd2e3+v+/H2saOj7J+kuSe+oej05fwYzx5K/yrGX9AlJjwAX1jsOPSci/GjTB3A0sBHYegvrXQQ8Saq9bAW8HLgYuBJ4BTCd9Iv05Lz+xcDH8vNzgfuBP61a9uf5+VnAt0a817V5/VcDffn15wvydSTwOHAg8DLg/wE/qlq+HHhLnXItB+4C9gB2Am4APpuX7Qy8C9g2l/FfgIUj8jlMOilvB1xeKQswADwBHJuP11vz6ylV2z4E7A9sDewAPAXsk5fvBuxfkOezqt5nOhDAP+ZjdQDwLPDbdT7HSvm2JwWV/x65LL8+HFjZ4LE6HFgJvAG4OacdCywC/gi4NqdtB6wAPpjLPTN/fvvl5ZcC383rvSYf3+sLyrKlY/w2UlAS8CZS8D4wLzuY9H1+a952ANh3FN+/jwP/XPV6NnBnCfk7nPR/+QXS97pvos8V7fRwTaW97QI8HhEbKwlVv6zWS3pj1bpXRsQNEfECsIH0i3RepNrNcuBvgRPzuteR/lEAfg+YX/X6TXl5PRdGxM8jYj3pJPPagvVOAC6IiFsj1armkZpgpm+p4FW+FhErImI1cDbwXoCIeCIiLo+IdRHxdF72phHbXhIRd0XEM8D/Bd6df6G/H7g6Iq6OiBciYjEwRDrBVFwUEXfnY78ReAF4jaS+iFgVEc00230qItZHxB3AHaTgUuQvJa0FlpECyx828T41j1VFRNwI7CRpH+ADpB8Q1d4OLI+ICyNiY0TcRgrGf5CP27uAv4mIZyLiLmBBnbzUPcYR8R8RcX8k1wHfJ30XAU4mfW8W522HI+JnVftu9Pv3LeBYSb+RX58IXFJC/iB9H86MiGdzPixzUGlvTwC7qKrNNiLeEBH9eVn157ei6vkuwGTgwaq0B0m/ziAFjd+TtBswifSPeVg+2e8A3L6FfD1S9Xwd6eRXy9TqPETEr3K+BwrWr6W6XA/mfSJpW0nfVGrKewr4EdCvzfuTRm47mXRs9iSdKNdWHsDvkmogL9k2B6X3kPqyVkn6DzXXgd7o8QL4UkT0R8QrI+KdEXF/E+9T81iNcAnwZ8ARwL+OWLYncMiI43ICqS9vCqn2MvI9itQ9xpKOkXRTbmZbSzqZV5qx9iDVRoo0dDwj4mFSje1dkvqBY9jU3DeW/AE8FhG/rpPHnuUOpvb2Y1JzyWzSL8Z6qqebfpxUW9kTuCenTSM1VxARyyStAz5Cao56KrcNn0Jqznihxj5H4+GcBwAkbUdqthpuYh97VD2flvcJ8DFgH+CQiHhE0muB20jNFUXbbiAdmxWkWswf13nfzcoeEYuARZL6gM+SmrR+r9aGLfIMqamv4pU11ik6VtUuIdWCLo6Iddp8AOEK4LqIeOvIjXKw3pjfo1JrmFYnv4XHWKn/7HJSbenKiNggaSGbPrsVbKG/pgkLSE18WwM/jojKd28s+YOx/290LddU2lhErAU+BXxd0u9LeoWkrfIJdLs62z1Pqn2cnbfZE/gLUnNAxXWkX6yVpq5rR7wG+CUwXaMf/fQd4IOSXpv/UT9HatNf3sQ+Pixpd6VO9k8C/5zTXwGsB9bmZWfW2Pb9kvaTtC1pwMNl+dh8C3iHpFmSJkl6ee583b3GPpC0a+443o4U5H9Fav4YT7eTmnJ2kvRK4KM11ik6Vi+KiF+QmglrDRj4d+DVkk7MndqTJb1O0m/n43YFcFauJe5HGj1WpN4x3obUF/EYsFHSMcBRVdueT/revDl/3wearBlWW0jq0zuNzZv7xpI/q8NBpc1FxBdJAeHjpJP8L4FvAp8Abqyz6UdIv24fAK4ndfpeULX8OtKJ+UcFryF1fgM8IenWUeT9v0h9GZcDq0i/Po9vcjf/RGrPfoDUJFIZAfX3pI7ax4GbgO/V2PYSUgf3I6TBC6fmfK0g1f7+inTiWAGcTvH/w1akz+BhYDXppPynTZZjrC4h9ccsJx2PlwQMio/VZiLi+tw0NDL9adLJ83hSWR9hU2c0pB8d2+f0i6gz6qneMc7vcyrph88a4H3AVVXb/oQ0WODLpA7766iq8TYj93dcDuxFCopjzp/VpwjX4qw9SVoO/FEOTs1uey1pFNa4X91u7UXS3wCvjoj3T3ReeoH7VMysa+WmwJPZNPLRWszNX2bWlST9MalZ6z8j4kdbWt/K4eYvMzMrjWsqZmZWmq7sU9lll11i+vTpE50NM7OOsmTJkscjYspY9tGVQWX69OkMDQ1NdDbMzDqKpHqzJDTEzV9mZlYaBxUzMyuNg4qZmZXGQcXMzErjoGJmZqXpytFfZmbdauFtw5yzaCkPr13P1P4+Tp+1D3NmNnOLotZyUDEz6xALbxtm3hV3sn7D8wAMr13PvCvuBGibwOKgYmbWZopqI+csWvpiQKlYv+F5zlm01EHFzMxeql5t5OG162tuU5Q+EdxRb2bWRurVRqb299Xcpih9IjiomJm1kXq1kdNn7UPf5EmbpfdNnsTps/YZj6w1xEHFzKyN1KuNzJk5wPzjZjDQ34eAgf4+5h83o236U8B9KmZmbeX0Wfts1qcCm9dG5swcaKsgMpKDiplZG6kEjHa+FqUeBxUzszbT7rWRetynYmZmpXFQMTOz0jiomJlZaRxUzMysNA4qZmZWGgcVMzMrjYcUm5lNkHa/N8poOKiYmU2ATrg3ymi4+cvMbALUm424kzmomJlNgE64N8potDSoSPpzSXdLukvSdyS9XNJekm6WtEzSP0vaJq/7svx6WV4+vWo/83L6UkmzWplnM7Px0An3RhmNlgUVSQPAqcBgRLwGmAQcD3wB+HJE/BawBjg5b3IysCanfzmvh6T98nb7A0cDX5e0+Q0FzMw6TCfcG2U0Wt38tTXQJ2lrYFtgFXAkcFlevgCYk5/Pzq/Jy98sSTn90oh4NiJ+ASwDDm5xvs3MWqoT7o0yGi0b/RURw5K+BDwErAe+DywB1kbExrzaSqByBAeAFXnbjZKeBHbO6TdV7bp6mxdJOgU4BWDatGmll8fMrGydPBtxkVY2f+1IqmXsBUwFtiM1X7VERJwbEYMRMThlypRWvY2ZmdXRyuavtwC/iIjHImIDcAVwGNCfm8MAdgeG8/NhYA+AvHwH4Inq9BrbmJlZG2llUHkIOFTStrlv5M3APcA1wO/ndeYCV+bnV+XX5OU/jIjI6cfn0WF7AXsDP2lhvs3MbJRa2adys6TLgFuBjcBtwLnAfwCXSvpsTjs/b3I+cImkZcBq0ogvIuJuSd8lBaSNwIcjYvMrhszMJlg3TrkyGkqVge4yODgYQ0NDE50NM+sRI6dcgTQ8uNNGc0laEhGDY9mHr6g3Mxujbp1yZTQ8oaSZ2RjVm3Kl15rFXFMxMxujoqlV+redzLwr7mR47XqCTTMRL7ytewewOqiYmY1R0ZQrEfRcs5iDipnZGBVNufLk+g011+/0mYjrcZ+KmVkJak25cs6ipQzXCCCdPhNxPa6pmJm1SLfORFyPaypmZi1Sqbn00ugvBxUzsxbqxpmI63FQMTNrQq9dd9IsBxUzswaNnI6lct0J4MCSuaPezKxBno5lyxxUzMwaVG86FkscVMzMGlR0fUk3X3fSLAcVM7MG9eJ1J81yR72ZWYN68bqTZjmomJk1odeuO2mWm7/MzKw0DipmZlYaBxUzMyuNg4qZmZXGQcXMzErjoGJmZqVxUDEzs9I4qJiZWWkcVMzMrDQOKmZmVhoHFTMzK42DipmZlcZBxczMSuOgYmZmpXFQMTOz0vh+KmZmIyy8bdg34holBxUzsyoLbxtm3hV3sn7D8wAMr13PvCvuBHBgaYCbv8zMqpyzaOmLAaVi/YbnOWfR0gnKUWdxTcXMelatZq6H166vuW5Rum2upTUVSf2SLpP0M0n3Snq9pJ0kLZZ0X/67Y15Xkr4qaZmkn0o6sGo/c/P690ma28o8m1lvqDRzDa9dT7CpmWuHvsk115/a3ze+GexQrW7++grwvYjYFzgAuBc4A/hBROwN/CC/BjgG2Ds/TgG+ASBpJ+BM4BDgYODMSiAyMxutomYuCfomT9osvW/yJE6ftc94Zq9jtSyoSNoBeCNwPkBEPBcRa4HZwIK82gJgTn4+G7g4kpuAfkm7AbOAxRGxOiLWAIuBo1uVbzPrDUXNWWvXbWD+cTMY6O9DwEB/H/OPm+FO+ga1sk9lL+Ax4EJJBwBLgNOAXSNiVV7nEWDX/HwAWFG1/cqcVpS+GUmnkGo4TJs2rbxSmFlXmtrfx3CNwDK1v485MwccREaplc1fWwMHAt+IiJnAM2xq6gIgIgKIMt4sIs6NiMGIGJwyZUoZuzSzLnb6rH3czNUCrQwqK4GVEXFzfn0ZKcj8Mjdrkf8+mpcPA3tUbb97TitKNzMbtTkzB9zM1QIta/6KiEckrZC0T0QsBd4M3JMfc4HP579X5k2uAv5M0qWkTvknI2KVpEXA56o6548C5rUq32bWO9zMVb5WX6fyEeDbkrYBHgA+SKodfVfSycCDwLvzulcDxwLLgHV5XSJitaTPALfk9T4dEatbnG8zMxsFpW6N7jI4OBhDQ0MTnQ0zs44iaUlEDI5lH56mxczMSuNpWsys63nW4fHjoGJmXc2zDo8vN3+ZWVfzrMPjy0HFzLqaZx0eXw4qZtbVimYX9qzDreGgYmZdzdOxjC931JtZV6t0xnv01/hwUDGzjjKa4cGejmX8OKiYWceoNzwYXBtpBw4qZtYxioYHf+rf7ubXG17wtShtwB31ZtYxioYBr1m3wdeitAkHFTPrGM0OA/a1KOPPQcXMOkbR8OD+vsk11/e1KOOvoaAi6TBJ2+Xn75f0d5L2bG3WzMw2V3S3xrPeub+vRWkTjXbUfwM4QNIBwMeA84CLgTe1KmNmZrXUGx7s0V8Tr9GgsjEiQtJs4GsRcX6+c6OZWVvwtSjtodGg8rSkecD7gTdK2gqo3YhpZmY9q9GO+vcAzwInR8QjwO7AOS3LlZmZdaQt1lQkTQK+ExFHVNIi4iFSn4qZmdmLtlhTiYjngRck7TAO+TEzsw7WaJ/Kr4A7JS0GnqkkRsSpLcmVmfU831e+MzUaVK7IDzOzlvN95TtXQ0ElIhZI6gOmRYQn0zGzlqp3X3kHlfbW6BX17wBuB76XX79W0lWtzJiZ9S7fV75zNTqk+CzgYGAtQETcDryqRXkysx7n+8p3rkaDyoaIeHJE2gtlZ8bMDHxf+U7WaEf93ZLeB0yStDdwKnBj67JlZr3M95XvXI0GlY8AnyRdVf8dYBHwmVZlyszMc3l1pkZHf60DPinpC+llPN3abJmZWSdqKKhIeh1wAfCK/PpJ4KSIWNLCvJlZl/MFjt2n0eav84EPRcR/A0j6XeBC4HdalTEz626+wLE7NTr66/lKQAGIiOuBja3Jkpn1gnoXOFrnarSmcp2kb5I66YM0Ff61kg4EiIhbW5Q/M+tSvsCxOzUaVA7If88ckT6TFGSOLC1HZtYTpvb3MVwjgPgCx87W6OivI7a8Vm35fixDwHBEvF3SXsClwM7AEuDEiHhO0stI92g5CHgCeE9ELM/7mAecDDwPnBoRi0abHzMbf7U65E+ftc9mfSrgCxy7QaN9Kkh6m6SPS/qbyqPBTU8D7q16/QXgyxHxW8AaUrAg/12T07+c10PSfsDxwP7A0cDXc6Aysw5Q6ZAfXrueYPMO+fnHzWCgvw8BA/19zD9uhjvpO1yjQ4r/AdgWOAI4D/h94CcNbLc78DbgbOAvJInUVPa+vMoC0rxi3wBm5+cAlwFfy+vPBi6NiGeBX0haRpqH7MeN5N3MJla9DvkbzjjSQaTLNFpTeUNEfIBUk/gU8Hrg1Q1s9/fAx9k0T9jOwNqIqIwcWwlUvlEDwAqAvPzJvP6L6TW2eZGkUyQNSRp67LHHGiyWmbWaO+R7S6NBpfLpr5M0FdgA7FZvA0lvBx4drwskI+LciBiMiMEpU6aMx1uaWQM843BvaTSo/LukfuCLpM715aThxfUcBrxT0nJSx/yRwFeAfkmVZrfdgeH8fBjYAyAv34HUYf9ieo1tzKzNecbh3tJoUPkScBJwIqkv44ukfpJCETEvInaPiOmkjvYfRsQJwDWkPhmAucCV+flV+TV5+Q8jInL68ZJelkeO7U0D/Tlm1h7mzBxwh3wPafQ6lQXA08BX8+v3kYb/vnsU7/kJ4FJJnwVuI00BQ/57Se6IX00KRETE3ZK+C9xDuor/wxHx/Et3a2btyjMO9w6lysAWVpLuiYj9tpTWLgYHB2NoaGiis2Fm1lEkLYmIwbHso9Hmr1slHVr1xoeQLmg0MzN7UaPNXwcBN0p6KL+eBiyVdCfp/iqerdjMzBoOKke3NBdm1hV8fxRrdO6vB1udETPrbL4/ikETc3+ZmdXj+6MYOKiYWUk8HYuBg4qZlcTTsRg4qJhZSTwdi0Hjo7/MzOqqdMZ79Fdvc1Axs9J4OhZz85eZmZXGQcXMzErjoGJmZqVxUDEzs9I4qJiZWWk8+svMmuJJI60eBxUza5gnjbQtcfOXmTXMk0baljiomFnDPGmkbYmbv8ysplp9J1P7+xiuEUA8aaRVuKZiZi9R6TsZXrueYFPfyRH7TvGkkVaXg4qZvURR38k1P3uM+cfNYKC/DwED/X3MP26GO+ntRW7+MrOXqNd34kkjrR7XVMzsJXzDLRstBxUzewnfcMtGy81fZvYSvuGWjZaDipnV5L4TGw0HFbMe57m8rEwOKmY9zHN5WdncUW/WwzyXl5XNQcWsh3kuLyubg4pZD/P1KFY2BxWzHubrUaxs7qg36wFFI7x8PYqVzUHFrMttaYSXr0exMrWs+UvSHpKukXSPpLslnZbTd5K0WNJ9+e+OOV2SvippmaSfSjqwal9z8/r3SZrbqjybdSOP8LLx1Mo+lY3AxyJiP+BQ4MOS9gPOAH4QEXsDP8ivAY4B9s6PU4BvQApCwJnAIcDBwJmVQGRmW+YRXjaeWhZUImJVRNyanz8N3AsMALOBBXm1BcCc/Hw2cHEkNwH9knYDZgGLI2J1RKwBFgNHtyrfZt3GI7xsPI3L6C9J04GZwM3ArhGxKi96BNg1Px8AVlRttjKnFaWbWQM8wsvGU8s76iVtD1wOfDQinpL04rKICElR0vucQmo2Y9q0aWXs0qwreISXjaeWBhVJk0kB5dsRcUVO/qWk3SJiVW7eejSnDwN7VG2+e04bBg4fkX7tyPeKiHOBcwEGBwdLCVRmnabe0GEHERsPrRz9JeB84N6I+LuqRVcBlRFcc4Erq9I/kEeBHQo8mZvJFgFHSdoxd9AfldPMrEpl6PDw2vUEm4YOL7xteKKzZj2klX0qhwEnAkdKuj0/jgU+D7xV0n3AW/JrgKuBB4BlwD8CHwKIiNXAZ4Bb8uPTOc3MqnjosLWDljV/RcT1gAoWv7nG+gF8uGBfFwAXlJc7s+7jocPWDjz3l1mX8NBhawcOKmZdwkOHrR147i+zLuGhw9YOHFTMOpCHDlu7clAx6zC+r7y1MwcVsw5Tb+iwg0rvKqq9jjcHFbM2VXSS8NBhG6mdaq8e/WXWhupdHe+hwzZSO1346qBi1obqnSQ8dNhGaqfaq4OKWRuqd5KYM3OA+cfNYKC/DwED/X3MP26G+1N6WDvVXt2nYtaGpvb3MVwjsFROEh46bNVOn7XPZn0qMHG1VwcVswlWq0O+nU4S1v7a6cJXpXkcu8vg4GAMDQ1NdDbMtmjkqB1IwWP+cTOA9jhJWO+QtCQiBseyD9dUzCZQvQ75G8440kHEOo476s0mUDuN2jErg2sqZuOkVt/JljrkzTqNaypm46DoYsYj9p3ia06sqziomI2Dor6Ta372mK85sa7i5i+zcbClixkdRKxbuKZiNg7a6Ypns1ZyUDEbB56vy3qFm7/MSlTvjozgixmt+zmomJVkS/e0cN+J9QI3f5mVpJ3uaWE2URxUzEriq+PNHFTMSuMRXmbuUzEbFU9Xb1abaypmTSqacgXw1fHW81xTMaujVo3E09WbFXNQMStQNER4ZECpcIe8mZu/zAoV1UgmSTXXd4e8mWsqZoVXwRfVPJ6PoG/yJHfIm9XgoGI9o1bwAAqvgi+6gdZAVd+Kp1wx25wiYqLzULrBwcEYGhqa6GxYGxnZPwKpdvGyrbdi7foNL1m/EjhqbeMRXdatJC2JiMGx7MM1FetIRU1WRcuK+kfqdbp7Ekiz5jmo2KjVO7E3u37RsmabrIqWFQWPIpVOd08Cadacjmn+knQ08BVgEnBeRHy+aN3RNn81c2Krl17me5Sdr7L2VdScNP+4GcBLf90DddevtexdBw1w+ZLhppqsgJr9IJMknq/xXd9x28n8esMLbuIyo5zmr44IKpImAT8H3gqsBG4B3hsR99RafzRBpegkWXRiK0qvdzJq9j1Gc8Idr32ds2hpzZN30Um6zEBQpDLQt2iLWiO2ioKgA4r1ol4KKq8HzoqIWfn1PICImF9r/dEElcM+/8OmTmxF6QP9fdxwxpGlvMdoTrjjta+H8xQlY7WlQNCMemX0iC2zLeuljvoBYEXV65XAIdUrSDoFOAVg2rRpTb9BvWsSmkmvd1V1s+9Rb19l5ms0+yoabtusqSU2WdVrZht5B0Yza42uuaI+Is6NiMGIGJwyZUrT2xddDV109fRorqpu9j2m9veVmq8y91V0z/X+vsk1t9lx28mF92gv2td7D9mjZvqZ79i/cOLGOTMHPKmj2QTqlJrKMLBH1evdc1ppiq5JaLa/od5V1c2+R71f3qPNV1n7KhpuW/QeZ75j/5rrV5/say0b3HOnwm2KAoVrJGYTp1OCyi3A3pL2IgWT44H3lfkG9a5JKDqx1TvhlfUeFWXmq6x91Tt5lxUIHCDMOktHdNQDSDoW+HvSkOILIuLsonV9Rb2ZWfN6qaOeiLgauHqi82FmZsW6pqPezMwmnoOKmZmVxkHFzMxK46BiZmal6ZjRX82Q9Bjw4Bh2sQvweEnZ6TQue+/q5fL3ctlhU/n3jIjmrx6v0pVBZawkDY11WF2nctl7s+zQ2+Xv5bJDueV385eZmZXGQcXMzErjoFLbuROdgQnksveuXi5/L5cdSiy/+1TMzKw0rqmYmVlpHFTMzKw0PRFUJF0g6VFJd1WlHSDpx5LulPRvkn4jp58g6faqxwuSXpuXHZTXXybpq1LBXazaSJNlnyxpQU6/t3Lb5rzsaElLc9nPmIiyjEaT5d9G0oU5/Q5Jh1dt04mf/R6SrpF0j6S7JZ2W03eStFjSffnvjjlduWzLJP1U0oFV+5qb179P0tyJKlOjRlH2ffN34llJfzliXx333R9F+U/In/mdkm6UdEDVvporf0R0/QN4I3AgcFdV2i3Am/Lzk4DP1NhuBnB/1eufAIeSbq3+n8AxE122MstOukfNpfn5tsByYDrpdgP3A68CtgHuAPab6LK1oPwfBi7Mz38TWAJs1cGf/W7Agfn5K4CfA/sBXwTOyOlnAF/Iz4/NZVMu6805fSfggfx3x/x8x4kuX8ll/03gdcDZwF9W7acjv/ujKP8bKp8pcEzVZ990+XuiphIRPwJWj0h+NfCj/Hwx8K4am74XuBRA0m7Ab0TETZGO9sXAnNbkuDxNlj2A7SRtDfQBzwFPAQcDyyLigYh4jnRMZrc672Vosvz7AT/M2z0KrAUGO/izXxURt+bnTwP3AgOkz25BXm0Bm8oyG7g4kpuA/lz2WcDiiFgdEWtIx+zocSxK05ote0Q8GhG3ABtG7Kojv/ujKP+N+bMFuIl0d10YRfl7IqgUuJtNB+cP2Px2xRXvAb6Tnw8AK6uWrcxpnaio7JcBzwCrgIeAL0XEalI5V1Rt38llh+Ly3wG8U9LWSncZPSgv6/jPXtJ0YCZwM7BrRKzKix4Bds3Piz7njv78Gyx7kY4uO4yq/CeTaqwwivL3clA5CfiQpCWk6uFz1QslHQKsi4i7am3c4YrKfjDwPDAV2Av4mKRXTUwWW6qo/BeQ/mmGSHcZvZF0PDqapO2By4GPRsRT1ctyzatrryvo5bJD8+WXdAQpqHxitO/ZMXd+LFtE/AwfzJjqAAADXklEQVQ4CkDSq4G3jVjleDbVUgCG2VQlJD8fbmUeW6VO2d8HfC8iNgCPSroBGCT9UqmuyXVs2aG4/BGxEfjzynqSbiS1Ra+hQz97SZNJJ5VvR8QVOfmXknaLiFW5eevRnD5M7c95GDh8RPq1rcx3GZose5GiY9L2mi2/pN8BziP1Fz6Rk5suf8/WVCT9Zv67FfDXwD9ULdsKeDe5PwVSGyXwlKRD88ifDwBXjmumS1Kn7A8BR+Zl25E6a39G6tjeW9JekrYhBdyrxjvfZSkqv6Rtc7mR9FZgY0Tc06mffc7r+cC9EfF3VYuuAiojuOayqSxXAR/Io8AOBZ7MZV8EHCVpxzxa6Kic1rZGUfYiHfndb7b8kqYBVwAnRsTPq9ZvvvwTPUphPB6kGscqUifcSlL17jTSr9CfA58nzy6Q1z8cuKnGfgaBu0ijIb5WvU27PpopO7A98C+kPod7gNOr9nNsXv9+4JMTXa4WlX86sJTUqflfpGnAO/mz/11S88ZPgdvz41hgZ+AHwH25nDvl9QX8/1zGO4HBqn2dBCzLjw9OdNlaUPZX5u/HU6QBGitJgzM68rs/ivKfR6qRV9YdqtpXU+X3NC1mZlaanm3+MjOz8jmomJlZaRxUzMysNA4qZmZWGgcVMzMrjYOKmZmVxkHFrE1JmjTReTBrloOKWQkkfVrSR6teny3pNEmnS7ol36viU1XLF0paku91cUpV+q8k/a2kO4DXj3MxzMbMQcWsHBeQpm+pTP9yPGkW2L1JE3W+FjhI0hvz+idFxEGkK/VPlbRzTt+OdC+LAyLi+vEsgFkZenZCSbMyRcRySU9ImkmaTvw20k2fjsrPIU2DszfpXi6nSvrfOX2PnP4EaVbky8cz72ZlclAxK895wB+S5pG6AHgzMD8ivlm9ktJtit8CvD4i1km6Fnh5XvzriOj46fatd7n5y6w8/0q6I+LrSLP4LgJOyve0QNJAniF5B2BNDij7kmaDNusKrqmYlSQinpN0DbA21za+L+m3gR+nmcj5FfB+4HvAn0i6lzQr8k0TlWezsnmWYrOS5A76W4E/iIj7Jjo/ZhPBzV9mJZC0H+leIz9wQLFe5pqKmZmVxjUVMzMrjYOKmZmVxkHFzMxK46BiZmalcVAxM7PS/A+tcklXZoEF5QAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Get year and publications as list\n", "year = idsPerYear.select(\"year\").collect()\n", "publications = idsPerYear.select(\"count\").collect()\n", "\n", "# Make scatter plot with matplotlib\n", "plt.scatter(year, publications)\n", "plt.xlabel(\"year\")\n", "plt.ylabel(\"papers\")\n", "plt.title(\"Growth of papers in PubMed each year\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Terminate Spark" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "spark.stop()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }