mmtfPyspark.ml.proteinSequenceEncoder module¶

proteinSequenceEncoder.py

This class encodes a protein sequence into a feature vector. The protein sequence must be present in the input data set, the default column name is “sequence”. The default column name for the feature vector is “features”.

class ProteinSequenceEncoder(data=None, inputCol='sequence', outputCol='features')[source]¶

Bases: object

Attributes

data	(DataFrame) input data to be encoded [None]
inputCol	(str) name of the input column [sequence]
outputCol	(str) name of the output column [features]

Methods

`blosum62_encode`([data, inputCol, outputCol])	Encodes a protein sequence by 7 Blosum62
`get_word2vec_model`()	Returns a Word2VecModel created by overlapping_ngram_word2vec_encode()
`one_hot_encode`([data, inputCol, outputCol])	One-hot encodes a protein sequence.
`overlapping_ngram_word2vec_encode`([data, …])	Encodes a protein sequence by converting it into n-grams and then transforming it into a Word2Vec feature vector.
`property_encode`([data, inputCol, outputCol])	Encodes a protein sequence by 7 physicochemical properties
`shifted_3gram_word2vec_encode`([data, …])	Encodes a protein sequence as three non-overlapping 3-grams, trains a Word2Vec model on the 3-grams, and then averages the three resulting freature vectors.

AMINO_ACIDS21 = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y']¶

blosum62 = {'A': [4, -1, -2, -2, 0, -1, -1, 0, -2, -1, -1, -1, -1, -2, -1, 1, 0, -3, -2, 0], 'C': [0, -3, -3, -3, 9, -3, -4, -3, -3, -1, -1, -3, -1, -2, -3, -1, -1, -2, -2, -1], 'D': [-2, -2, 1, 6, -3, 0, 2, -1, -1, -3, -4, -1, -3, -3, -1, 0, -1, -4, -3, -3], 'E': [-1, 0, 0, 2, -4, 2, 5, -2, 0, -3, -3, 1, -2, -3, -1, 0, -1, -3, -2, -2], 'F': [-2, -3, -3, -3, -2, -3, -3, -3, -1, 0, 0, -3, 0, 6, -4, -2, -2, 1, 3, -1], 'G': [0, -2, 0, -1, -3, -2, -2, 6, -2, -4, -4, -2, -3, -3, -2, 0, -2, -2, -3, -3], 'H': [-2, 0, 1, -1, -3, 0, 0, -2, 8, -3, -3, -1, -2, -1, -2, -1, -2, -2, 2, -3], 'I': [-1, -3, -3, -3, -1, -3, -3, -4, -3, 4, 2, -3, 1, 0, -3, -2, -1, -3, -1, 3], 'K': [-1, 2, 0, -1, -3, 1, 1, -2, -1, -3, -2, 5, -1, -3, -1, 0, -1, -3, -2, -2], 'L': [-1, -2, -3, -4, -1, -2, -3, -4, -3, 2, 4, -2, 2, 0, -3, -2, -1, -2, -1, 1], 'M': [-1, -1, -2, -3, -1, 0, -2, -3, -2, 1, 2, -1, 5, 0, -2, -1, -1, -1, -1, 1], 'N': [-2, 0, 6, 1, -3, 0, 0, 0, 1, -3, -3, 0, -2, -3, -2, 1, 0, -4, -2, -3], 'P': [-1, -2, -2, -1, -3, -1, -1, -2, -2, -3, -3, -1, -2, -4, 7, -1, -1, -4, -3, -2], 'Q': [-1, 1, 0, 0, -3, 5, 2, -2, 0, -3, -2, 1, 0, -3, -1, 0, -1, -2, -1, -2], 'R': [-1, 5, 0, -2, -3, 1, 0, -2, 0, -3, -2, 2, -1, -3, -2, -1, -1, -3, -2, -3], 'S': [1, -1, 1, 0, -1, 0, 0, 0, -1, -2, -2, 0, -1, -2, -1, 4, 1, -3, -2, -2], 'T': [0, -1, 0, -1, -1, -1, -1, -2, -2, -1, -1, -1, -1, -2, -1, 1, 5, -2, -2, 0], 'V': [0, -3, -3, -3, -1, -2, -2, -3, -3, 3, 1, -2, 1, -1, -2, -2, 0, -3, -1, 4], 'W': [-3, -3, -4, -4, -2, -2, -3, -2, -2, -3, -2, -3, -1, 1, -4, -3, -2, 11, 2, -3], 'X': [-4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4], 'Y': [-2, -2, -2, -3, -2, -1, -2, -3, 2, -1, -1, -2, -1, 3, -3, -2, -2, 2, 7, -1]}¶

blosum62_encode(data=None, inputCol=None, outputCol=None)[source]¶

Encodes a protein sequence by 7 Blosum62

Parameters:

Parameters:	data : DataFrame input data to be encoded [None] inputCol : str name of the input column [None] outputCol : str name of the output column [None]
Returns:	dataset dataset with feature vector appended

data : DataFrame

input data to be encoded [None]

inputCol : str

name of the input column [None]

outputCol : str

name of the output column [None]

Returns:

dataset

dataset with feature vector appended

References

Blosum Matrix https://ftp.ncbi.nih.gov/repository/blocks/unix/blosum/BLOSUM/blosum62.blast.new

get_word2vec_model()[source]¶

Returns a Word2VecModel created by overlapping_ngram_word2vec_encode()

Returns:

Returns:	model overlapping Ngram Word2VecModel if available, otherwise None

model

overlapping Ngram Word2VecModel if available, otherwise None

model = None¶

one_hot_encode(data=None, inputCol=None, outputCol=None)[source]¶

One-hot encodes a protein sequence. The one-hot encoding encodes the 20 natural amino acids, plus X for any other residue for a total of 21 elements per residue.

Parameters:

Parameters:	data : DataFrame input data to be encoded [None] inputCol : str name of the input column [None] outputCol : str name of the output column [None]

data : DataFrame

input data to be encoded [None]

inputCol : str

name of the input column [None]

outputCol : str

name of the output column [None]

overlapping_ngram_word2vec_encode(data=None, inputCol=None, outputCol=None, n=None, windowSize=None, vectorSize=None, fileName=None, sc=None)[source]¶

Encodes a protein sequence by converting it into n-grams and then transforming it into a Word2Vec feature vector.

If given word2Vec file name, then this function encodes a protein sequence by converting it into n-grams and then transforming it using pre-trained word2Vec model read from that file

Parameters:

Parameters:	data : DataFrame input data to be encoded [None] inputCol : str name of the input column [None] outputCol : str name of the output column [None] n : int The number of words in an n-gram [None] windowSize : int width of the window used to slide across the squence, context words from -window to window [None] vectorSize :int dimension of the feature vector [None] fileName : str filename of Word2Vec model [None]
Returns:	dataset dataset with features vector added to original dataset

data : DataFrame

input data to be encoded [None]

inputCol : str

name of the input column [None]

outputCol : str

name of the output column [None]

n : int

The number of words in an n-gram [None]

windowSize : int

width of the window used to slide across the squence, context words from -window to window [None]

vectorSize :int

dimension of the feature vector [None]

fileName : str

filename of Word2Vec model [None]

Returns:

dataset

dataset with features vector added to original dataset

properties = {'A': [1.28, 0.05, 1.0, 0.31, 6.11, 0.42, 0.23], 'C': [1.77, 0.13, 2.43, 1.54, 6.35, 0.17, 0.41], 'D': [1.6, 0.11, 2.78, -0.77, 2.95, 0.25, 0.2], 'E': [1.56, 0.15, 3.78, -0.64, 3.09, 0.42, 0.21], 'F': [2.94, 0.29, 5.89, 1.79, 5.67, 0.3, 0.38], 'G': [0.0, 0.0, 0.0, 0.0, 6.07, 0.13, 0.15], 'H': [2.99, 0.23, 4.66, 0.13, 7.69, 0.27, 0.3], 'I': [4.19, 0.19, 4.0, 1.8, 6.04, 0.3, 0.45], 'K': [1.89, 0.22, 4.77, -0.99, 9.99, 0.32, 0.27], 'L': [2.59, 0.19, 4.0, 1.7, 6.04, 0.39, 0.31], 'M': [2.35, 0.22, 4.43, 1.23, 5.71, 0.38, 0.32], 'N': [1.6, 0.13, 2.95, -0.6, 6.52, 0.21, 0.22], 'P': [2.67, 0.0, 2.72, 0.72, 6.8, 0.13, 0.34], 'Q': [1.56, 0.18, 3.95, -0.22, 5.65, 0.36, 0.25], 'R': [2.34, 0.29, 6.13, -1.01, 10.74, 0.36, 0.25], 'S': [1.31, 0.06, 1.6, -0.04, 5.7, 0.2, 0.28], 'T': [3.03, 0.11, 2.6, 0.26, 5.6, 0.21, 0.36], 'V': [3.67, 0.14, 3.0, 1.22, 6.02, 0.27, 0.49], 'W': [3.21, 0.41, 8.08, 2.25, 5.94, 0.32, 0.42], 'X': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'Y': [2.94, 0.3, 6.47, 0.96, 5.66, 0.25, 0.41]}¶

property_encode(data=None, inputCol=None, outputCol=None)[source]¶

Encodes a protein sequence by 7 physicochemical properties

Parameters:

Parameters:	data : DataFrame input data to be encoded [None] inputCol : str name of the input column [None] outputCol : str name of the output column [None]
Returns:	dataset dataset with feature vector appended

data : DataFrame

input data to be encoded [None]

inputCol : str

name of the input column [None]

outputCol : str

name of the output column [None]

Returns:

dataset

dataset with feature vector appended

References

Meiler, J., Müller, M., Zeidler, A. et al. J Mol Model (2001) https://link.springer.com/article/10.1007/s008940100038

shifted_3gram_word2vec_encode(data=None, inputCol=None, outputCol=None, windowSize=None, vectorSize=None, fileName=None, sc=None)[source]¶

Encodes a protein sequence as three non-overlapping 3-grams, trains a Word2Vec model on the 3-grams, and then averages the three resulting freature vectors.

Parameters:

Parameters:	data : DataFrame input data to be encoded [None] inputCol : str name of the input column [None] outputCol : str name of the output column [None] windowSize : int width of the window used to slide across the sequence context words from -window to window vectorSize : int dimension of the feature vector [None] fileName : str filename of Word2VecModel [None] sc : SparkContext spark context [None]
Returns:	dataset dataset with features vector added to original dataset

data : DataFrame

input data to be encoded [None]

inputCol : str

name of the input column [None]

outputCol : str

name of the output column [None]

windowSize : int

width of the window used to slide across the sequence context words from -window to window

vectorSize : int

dimension of the feature vector [None]

fileName : str

filename of Word2VecModel [None]

sc : SparkContext

spark context [None]

Returns:

dataset

dataset with features vector added to original dataset

References

Asgari E, Mofrad MRK (2015) Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLOS ONE 10(11): e0141287. doi: https://doi.org/10.1371/journal.pone.0141287