mmtfPyspark.datasets.uniProt module

uniProt.py

This class downloads and reads UniProt sequence files in the FASTA format and converts them to datasets.This module reads the following files: - SWISS_PROT, - TREMBL, - UNIREF50, - UNIREF90, - UNIREF100.

References

Examples

Download, read, and save the SWISS_PROT dataset:

>>> ds = uniProt.get_dataset(UniProtDataset.SWISS_PROT)
>>> ds.printSchema()
>>> ds.show(5)
>>> ds.write().mode("overwrite").format("parquet").save(fileName)
get_dataset(UniProtDataset)[source]

Returns the specified UniProt dataset.

Parameters:

uniProtDataset : str

name of the UniProt dataset

Returns:

dataset

dataset with sequence and metadata