This demo demonstrates how to access the open DrugBank dataset. This dataset contains identifiers and names for integration with other data resources.
Wishart DS, et al., DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2017 Nov 8.
doi:10.1093/nar/gkx1037.
In [2]:
from pyspark.sql import SparkSession
from mmtfPyspark.datasets import drugBankDataset
In [4]:
spark = SparkSession.builder\
.master("local[*]")\
.appName("DrugBankDemo") \
.getOrCreate()
In [8]:
openDrugLinks = drugBankDataset.get_open_drug_links()
openDrugLinks.columns
Out[8]:
['DrugBankID',
'AccessionNumbers',
'Commonname',
'CAS',
'UNII',
'Synonyms',
'StandardInChIKey']
In [9]:
openDrugLinks = openDrugLinks.filter("StandardInChIKey IS NOT NULL")
In [10]:
openDrugLinks.select("DrugBankID","Commonname","CAS","StandardInChIKey").show()
+----------+--------------------+-----------+--------------------+
|DrugBankID| Commonname| CAS| StandardInChIKey|
+----------+--------------------+-----------+--------------------+
| DB00006| Bivalirudin|128270-60-0|OIRCOABEOLEUMC-GE...|
| DB00014| Goserelin| 65807-02-5|BLCLNMBMMGCOAS-UR...|
| DB00027| Gramicidin D| 1405-97-6|NDAYQJDHGXTBJL-MW...|
| DB00035| Desmopressin| 16679-58-6|NFLWUMRGJYTJIN-NX...|
| DB00050| Cetrorelix|120287-85-6|SBNPWPIBESPSIF-MH...|
| DB00080| Daptomycin|103060-53-3|DOAKLVKFURWEDJ-RW...|
| DB00091| Cyclosporine| 59865-13-3|PMATZTZNYRCHOR-CG...|
| DB00093| Felypressin| 56-59-7|SFKQVVDKFKYTNA-DZ...|
| DB00104| Octreotide| 83150-76-9|DEQANNDTNATYII-OU...|
| DB00106| Abarelix|183552-38-7|AIWRTTMUVOZGPW-HS...|
| DB00114| Pyridoxal Phosphate| 54-47-7|NGVDGCNFYWLIFO-UH...|
| DB00115| Cyanocobalamin| 68-19-9|RMRCNWBMXRMIRW-WZ...|
| DB00116|Tetrahydrofolic acid| 135-16-0|MSTNYGQPCMXVAQ-KI...|
| DB00117| Histidine| 71-00-1|HNDVDQJCIGZPNO-YF...|
| DB00118| Ademetionine| 29908-03-0|MEFKEPWMEQBLKI-AI...|
| DB00119| Pyruvic acid| 127-17-3|LCTONWCANYUPML-UH...|
| DB00120| L-Phenylalanine| 63-91-2|COLNVLDHVKWLRT-QM...|
| DB00121| Biotin| 58-85-5|YBJHBAHKTGYVGT-ZK...|
| DB00122| Choline| 62-49-7|OEYIOHPDSNJKLS-UH...|
| DB00123| L-Lysine| 56-87-1|KDXKERNSBIXSRK-YF...|
+----------+--------------------+-----------+--------------------+
only showing top 20 rows
The DrugBank password protected datasets contain more information. YOu need to create a DrugBank account and supply username/passwork to access these datasets.
In [13]:
username = "<your DrugBank account username>"
password = "<your DrugBank account password>"
drugLinks = drugBankDataset.get_drug_links("APPROVED", username,password)
In [21]:
drugLinks.select("DrugBankID","Name","CASNumber","Formula","PubChemCompoundID",\
"PubChemSubstanceID","ChEBIID","ChemSpiderID").show()
+----------+-------------------+-----------+---------------+-----------------+------------------+-------+------------+
|DrugBankID| Name| CASNumber| Formula|PubChemCompoundID|PubChemSubstanceID|ChEBIID|ChemSpiderID|
+----------+-------------------+-----------+---------------+-----------------+------------------+-------+------------+
| DB00006| Bivalirudin|128270-60-0| C98H138N24O33| 16129704| 46507415| 59173| 10482069|
| DB00014| Goserelin| 65807-02-5| C59H84N18O14| 5311128| 46507336| 5523| 4470656|
| DB00027| Gramicidin D| 1405-97-6| C96H135N19O16| 45267103| 46507412| null| 24623445|
| DB00035| Desmopressin| 16679-58-6| C46H64N14O12S2| 16051933| 46507014| 4450| 10481973|
| DB00050| Cetrorelix|120287-85-6| C70H92ClN17O14| 25074887| 46505494| 59224| 10482082|
| DB00067| Vasopressin| 11000-17-2| null| null| 46505933| null| null|
| DB00080| Daptomycin|103060-53-3| C72H101N17O26| 16134395| 46504551| 600103| 10482098|
| DB00091| Cyclosporine| 59865-13-3| C62H111N11O12| 5284373| 46508198| 4031| 4447449|
| DB00093| Felypressin| 56-59-7| C46H65N13O11S2| 14257662| 46507522| 60564| 16736539|
| DB00104| Octreotide| 83150-76-9| C49H66N10O10S2| 448601| 46504600| null| 395352|
| DB00106| Abarelix|183552-38-7| C72H95ClN14O14| 16131215| 46508237| 337298| 10482301|
| DB00114|Pyridoxal Phosphate| 54-47-7| C8H10NO6P| 1051| 46506428| 18405| 1022|
| DB00115| Cyanocobalamin| 68-19-9|C63H88CoN14O14P| 70678590| 46509031| 17439| 21864832|
| DB00117| Histidine| 71-00-1| C6H9N3O2| 6274| 46507001| 15971| 6038|
| DB00118| Ademetionine| 29908-03-0| C15H22N6O5S| 34755| 46505280| 67040| 31982|
| DB00119| Pyruvic acid| 127-17-3| C3H4O3| 1060| 46505692| 32816| 1031|
| DB00120| L-Phenylalanine| 63-91-2| C9H11NO2| 6140| 46505708| 17295| 5910|
| DB00121| Biotin| 58-85-5| C10H16N2O3S| 171548| 46508694| 15956| 149962|
| DB00122| Choline| 62-49-7| C5H14NO| 305| 46508132| 15354| 299|
| DB00123| L-Lysine| 56-87-1| C6H14N2O2| 5962| 46504770| 18019| 5747|
+----------+-------------------+-----------+---------------+-----------------+------------------+-------+------------+
only showing top 20 rows
In [7]:
sc.stop()