# Keywork Search Demo

![pdbj](https://pdbj.org/content/default.svg)

PDBj Mine 2 RDB keyword search query and MMTF filtering using pdbid.
This filter searches the 'keyword' column in the brief_summary table for a keyword and returns a couple of columns for the matching entries.

[PDBj Mine Search Website](https://pdbj.org/mine)

## Imports

In [17]:
from pyspark import SparkConf, SparkContext
from mmtfPyspark.webfilters import PdbjMineSearch
from mmtfPyspark.datasets import pdbjMineDataset
from mmtfPyspark.io import mmtfReader

## Configure Spark Context

In [18]:
conf = SparkConf().setMaster("local[*]") \
                  .setAppName("keywordSearch")
sc = SparkContext(conf = conf)

## Read in MMTF files from local directory

In [20]:
path = "../../resources/mmtf_reduced_sample/"

pdb = mmtfReader.read_sequence_file(path, sc)

## Apply a SQL search on PDBj using a filter

In [21]:
sql = "select pdbid from keyword_search('porin')"

pdb = pdb.filter(PdbjMineSearch(sql))
print(pdb.keys().collect())
print("\n")
print(f"Number of entries matching query: {pdb.count()}")

['3M8B', '3M8D', '4AUI', '4PR7', '4RJW', '4RJX', '4RLC', '2L26', '1R1M', '3UPG', '3UU2', '1BT9', '2WVP', '2X9K', '3VY8', '3VY9', '3VZT', '3VZU', '3VZW', '2O4V', '2ODJ', '2OMF', '1H6S', '3SY7', '3SY9', '3SYB', '3SYS', '3SZD', '3SZV', '3T0S', '3T20', '3T24', '3JBU', '4JML', '4K34', '4K7K', '4K7R', '1NQE', '1NQG', '1NQH', '4MJT', '4MKO', '5NIK', '5NIL', '1E54', '3FCG', '5O8O', '2D57', '2GUF', '1A0S', '1A0T', '1AF6', '5DL6', '5DL7', '5DL8', '3JTY', '3K19', '3K1B', '3FIP', '3FMO', '3FMP', '3FYX', '4D65', '2KNS', '2KS4', '2KSM', '1GFM', '1GFN', '1GFO', '1GFP', '1GFQ', '2K0L', '2YSU', '3NB3', '5LDT', '5LDV', '2VDA', '2VDD', '2VDE', '2VQG', '2VQH', '2VQI', '2VQK', '2VQL', '2MLH', '3TZG', '2GSK', '5U1H', '4BUM', '2BR3', '2BR4', '2BR5', '2BRR', '2XE1', '2XE2', '2XE3', '2XE5', '2XET', '2XG6', '2XMN', '2PV1', '2PV2', '2PV3', '3PGR', '3PGS', '3PGU', '3PIK', '3POQ', '3POR', '3POU', '3POX', '3PRN', '4MKQ', '2ZZ9', '3A2S', '3L48', '2BM8', '2BM9', '3EMN', '2V9U', '2LBT', '2LCA', '2WJQ', '2WJR', '2WMZ',

## Apply a SQL search on PDBj and get a dataset

In [14]:
sql = "select pdbid, resolution, biol_species, db_uniprot, db_pfam, hit_score from keyword_search('porin') order by hit_score desc"

dataset = pdbjMineDataset.get_dataset(sql)
dataset.show(10)

+-----------+----------+--------------------+--------------------+-----------+---------+
|structureId|resolution|        biol_species|          db_uniprot|    db_pfam|hit_score|
+-----------+----------+--------------------+--------------------+-----------+---------+
|       3POR|       2.5|Rhodobacter capsu...|['P31243', 'PORI_...|['PF13609']| 0.095809|
|       2OMF|       2.4|Escherichia coli K12|['OMPF_ECOLI', 'P...|['PF00267']|0.0954989|
|       2POR|       1.8|Rhodobacter capsu...|['P31243', 'PORI_...|['PF13609']|0.0951392|
|       1GFP|       2.7|    Escherichia coli|['OMPF_ECOLI', 'P...|['PF00267']| 0.094717|
|       1GFQ|       2.8|    Escherichia coli|['OMPF_ECOLI', 'P...|['PF00267']| 0.094717|
|       1GFM|       3.5|    Escherichia coli|['OMPF_ECOLI', 'P...|['PF00267']| 0.094717|
|       1GFN|       3.1|    Escherichia coli|['OMPF_ECOLI', 'P...|         []| 0.094717|
|       1GFO|       3.3|    Escherichia coli|['OMPF_ECOLI', 'P...|['PF00267']| 0.094717|
|       1BT9|       3

## Terminate Spark Context

In [15]:
sc.stop()