mmtfPyspark.ml.proteinSequenceEncoder module

proteinSequenceEncoder.py

This class encodes a protein sequence into a feature vector. The protein sequence must be present in the input data set, the default column name is “sequence”. The default column name for the feature vector is “features”.

class ProteinSequenceEncoder(data=None, inputCol='sequence', outputCol='features')[source]

Bases: object

This class encodes a protein sequence into a feature vector. The protein sequence must be present in the input data set, the default column name is “sequence”. The default column name for the feature vector is “features”.

Attributes

data (DataFrame) input data to be encoded [None]
inputCol (str) name of the input column [sequence]
outputCol (str) name of the output column [features]

Methods

blosum62_encode([data, inputCol, outputCol]) Encodes a protein sequence by 7 Blosum62
get_word2vec_model() Returns a Word2VecModel created by overlapping_ngram_word2vec_encode()
one_hot_encode([data, inputCol, outputCol]) One-hot encodes a protein sequence.
overlapping_ngram_word2vec_encode([data, …]) Encodes a protein sequence by converting it into n-grams and then transforming it into a Word2Vec feature vector.
property_encode([data, inputCol, outputCol]) Encodes a protein sequence by 7 physicochemical properties
shifted_3gram_word2vec_encode([data, …]) Encodes a protein sequence as three non-overlapping 3-grams, trains a Word2Vec model on the 3-grams, and then averages the three resulting freature vectors.
AMINO_ACIDS21 = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y']
blosum62 = {'A': [4, -1, -2, -2, 0, -1, -1, 0, -2, -1, -1, -1, -1, -2, -1, 1, 0, -3, -2, 0], 'C': [0, -3, -3, -3, 9, -3, -4, -3, -3, -1, -1, -3, -1, -2, -3, -1, -1, -2, -2, -1], 'D': [-2, -2, 1, 6, -3, 0, 2, -1, -1, -3, -4, -1, -3, -3, -1, 0, -1, -4, -3, -3], 'E': [-1, 0, 0, 2, -4, 2, 5, -2, 0, -3, -3, 1, -2, -3, -1, 0, -1, -3, -2, -2], 'F': [-2, -3, -3, -3, -2, -3, -3, -3, -1, 0, 0, -3, 0, 6, -4, -2, -2, 1, 3, -1], 'G': [0, -2, 0, -1, -3, -2, -2, 6, -2, -4, -4, -2, -3, -3, -2, 0, -2, -2, -3, -3], 'H': [-2, 0, 1, -1, -3, 0, 0, -2, 8, -3, -3, -1, -2, -1, -2, -1, -2, -2, 2, -3], 'I': [-1, -3, -3, -3, -1, -3, -3, -4, -3, 4, 2, -3, 1, 0, -3, -2, -1, -3, -1, 3], 'K': [-1, 2, 0, -1, -3, 1, 1, -2, -1, -3, -2, 5, -1, -3, -1, 0, -1, -3, -2, -2], 'L': [-1, -2, -3, -4, -1, -2, -3, -4, -3, 2, 4, -2, 2, 0, -3, -2, -1, -2, -1, 1], 'M': [-1, -1, -2, -3, -1, 0, -2, -3, -2, 1, 2, -1, 5, 0, -2, -1, -1, -1, -1, 1], 'N': [-2, 0, 6, 1, -3, 0, 0, 0, 1, -3, -3, 0, -2, -3, -2, 1, 0, -4, -2, -3], 'P': [-1, -2, -2, -1, -3, -1, -1, -2, -2, -3, -3, -1, -2, -4, 7, -1, -1, -4, -3, -2], 'Q': [-1, 1, 0, 0, -3, 5, 2, -2, 0, -3, -2, 1, 0, -3, -1, 0, -1, -2, -1, -2], 'R': [-1, 5, 0, -2, -3, 1, 0, -2, 0, -3, -2, 2, -1, -3, -2, -1, -1, -3, -2, -3], 'S': [1, -1, 1, 0, -1, 0, 0, 0, -1, -2, -2, 0, -1, -2, -1, 4, 1, -3, -2, -2], 'T': [0, -1, 0, -1, -1, -1, -1, -2, -2, -1, -1, -1, -1, -2, -1, 1, 5, -2, -2, 0], 'V': [0, -3, -3, -3, -1, -2, -2, -3, -3, 3, 1, -2, 1, -1, -2, -2, 0, -3, -1, 4], 'W': [-3, -3, -4, -4, -2, -2, -3, -2, -2, -3, -2, -3, -1, 1, -4, -3, -2, 11, 2, -3], 'X': [-4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4], 'Y': [-2, -2, -2, -3, -2, -1, -2, -3, 2, -1, -1, -2, -1, 3, -3, -2, -2, 2, 7, -1]}
blosum62_encode(data=None, inputCol=None, outputCol=None)[source]

Encodes a protein sequence by 7 Blosum62

Parameters:

data : DataFrame

input data to be encoded [None]

inputCol : str

name of the input column [None]

outputCol : str

name of the output column [None]

Returns:

dataset

dataset with feature vector appended

References

Blosum Matrix https://ftp.ncbi.nih.gov/repository/blocks/unix/blosum/BLOSUM/blosum62.blast.new

get_word2vec_model()[source]

Returns a Word2VecModel created by overlapping_ngram_word2vec_encode()

Returns:

model

overlapping Ngram Word2VecModel if available, otherwise None

model = None
one_hot_encode(data=None, inputCol=None, outputCol=None)[source]

One-hot encodes a protein sequence. The one-hot encoding encodes the 20 natural amino acids, plus X for any other residue for a total of 21 elements per residue.

Parameters:

data : DataFrame

input data to be encoded [None]

inputCol : str

name of the input column [None]

outputCol : str

name of the output column [None]

overlapping_ngram_word2vec_encode(data=None, inputCol=None, outputCol=None, n=None, windowSize=None, vectorSize=None, fileName=None, sc=None)[source]

Encodes a protein sequence by converting it into n-grams and then transforming it into a Word2Vec feature vector.

If given word2Vec file name, then this function encodes a protein sequence by converting it into n-grams and then transforming it using pre-trained word2Vec model read from that file

Parameters:

data : DataFrame

input data to be encoded [None]

inputCol : str

name of the input column [None]

outputCol : str

name of the output column [None]

n : int

The number of words in an n-gram [None]

windowSize : int

width of the window used to slide across the squence, context words from -window to window [None]

vectorSize :int

dimension of the feature vector [None]

fileName : str

filename of Word2Vec model [None]

Returns:

dataset

dataset with features vector added to original dataset

properties = {'A': [1.28, 0.05, 1.0, 0.31, 6.11, 0.42, 0.23], 'C': [1.77, 0.13, 2.43, 1.54, 6.35, 0.17, 0.41], 'D': [1.6, 0.11, 2.78, -0.77, 2.95, 0.25, 0.2], 'E': [1.56, 0.15, 3.78, -0.64, 3.09, 0.42, 0.21], 'F': [2.94, 0.29, 5.89, 1.79, 5.67, 0.3, 0.38], 'G': [0.0, 0.0, 0.0, 0.0, 6.07, 0.13, 0.15], 'H': [2.99, 0.23, 4.66, 0.13, 7.69, 0.27, 0.3], 'I': [4.19, 0.19, 4.0, 1.8, 6.04, 0.3, 0.45], 'K': [1.89, 0.22, 4.77, -0.99, 9.99, 0.32, 0.27], 'L': [2.59, 0.19, 4.0, 1.7, 6.04, 0.39, 0.31], 'M': [2.35, 0.22, 4.43, 1.23, 5.71, 0.38, 0.32], 'N': [1.6, 0.13, 2.95, -0.6, 6.52, 0.21, 0.22], 'P': [2.67, 0.0, 2.72, 0.72, 6.8, 0.13, 0.34], 'Q': [1.56, 0.18, 3.95, -0.22, 5.65, 0.36, 0.25], 'R': [2.34, 0.29, 6.13, -1.01, 10.74, 0.36, 0.25], 'S': [1.31, 0.06, 1.6, -0.04, 5.7, 0.2, 0.28], 'T': [3.03, 0.11, 2.6, 0.26, 5.6, 0.21, 0.36], 'V': [3.67, 0.14, 3.0, 1.22, 6.02, 0.27, 0.49], 'W': [3.21, 0.41, 8.08, 2.25, 5.94, 0.32, 0.42], 'X': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'Y': [2.94, 0.3, 6.47, 0.96, 5.66, 0.25, 0.41]}
property_encode(data=None, inputCol=None, outputCol=None)[source]

Encodes a protein sequence by 7 physicochemical properties

Parameters:

data : DataFrame

input data to be encoded [None]

inputCol : str

name of the input column [None]

outputCol : str

name of the output column [None]

Returns:

dataset

dataset with feature vector appended

References

Meiler, J., Müller, M., Zeidler, A. et al. J Mol Model (2001) https://link.springer.com/article/10.1007/s008940100038

shifted_3gram_word2vec_encode(data=None, inputCol=None, outputCol=None, windowSize=None, vectorSize=None, fileName=None, sc=None)[source]

Encodes a protein sequence as three non-overlapping 3-grams, trains a Word2Vec model on the 3-grams, and then averages the three resulting freature vectors.

Parameters:

data : DataFrame

input data to be encoded [None]

inputCol : str

name of the input column [None]

outputCol : str

name of the output column [None]

windowSize : int

width of the window used to slide across the sequence context words from -window to window

vectorSize : int

dimension of the feature vector [None]

fileName : str

filename of Word2VecModel [None]

sc : SparkContext

spark context [None]

Returns:

dataset

dataset with features vector added to original dataset

References

Asgari E, Mofrad MRK (2015) Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLOS ONE 10(11): e0141287. doi: https://doi.org/10.1371/journal.pone.0141287