proteinSequenceEncoder.py
This class encodes a protein sequence into a feature vector. The protein sequence must be present in the input data set, the default column name is “sequence”. The default column name for the feature vector is “features”.
ProteinSequenceEncoder
(data=None, inputCol='sequence', outputCol='features')[source]¶Bases: object
This class encodes a protein sequence into a feature vector. The protein sequence must be present in the input data set, the default column name is “sequence”. The default column name for the feature vector is “features”.
Attributes
data | (DataFrame) input data to be encoded [None] |
inputCol | (str) name of the input column [sequence] |
outputCol | (str) name of the output column [features] |
Methods
blosum62_encode ([data, inputCol, outputCol]) |
Encodes a protein sequence by 7 Blosum62 |
get_word2vec_model () |
Returns a Word2VecModel created by overlapping_ngram_word2vec_encode() |
one_hot_encode ([data, inputCol, outputCol]) |
One-hot encodes a protein sequence. |
overlapping_ngram_word2vec_encode ([data, …]) |
Encodes a protein sequence by converting it into n-grams and then transforming it into a Word2Vec feature vector. |
property_encode ([data, inputCol, outputCol]) |
Encodes a protein sequence by 7 physicochemical properties |
shifted_3gram_word2vec_encode ([data, …]) |
Encodes a protein sequence as three non-overlapping 3-grams, trains a Word2Vec model on the 3-grams, and then averages the three resulting freature vectors. |
AMINO_ACIDS21
= ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y']¶blosum62
= {'A': [4, -1, -2, -2, 0, -1, -1, 0, -2, -1, -1, -1, -1, -2, -1, 1, 0, -3, -2, 0], 'C': [0, -3, -3, -3, 9, -3, -4, -3, -3, -1, -1, -3, -1, -2, -3, -1, -1, -2, -2, -1], 'D': [-2, -2, 1, 6, -3, 0, 2, -1, -1, -3, -4, -1, -3, -3, -1, 0, -1, -4, -3, -3], 'E': [-1, 0, 0, 2, -4, 2, 5, -2, 0, -3, -3, 1, -2, -3, -1, 0, -1, -3, -2, -2], 'F': [-2, -3, -3, -3, -2, -3, -3, -3, -1, 0, 0, -3, 0, 6, -4, -2, -2, 1, 3, -1], 'G': [0, -2, 0, -1, -3, -2, -2, 6, -2, -4, -4, -2, -3, -3, -2, 0, -2, -2, -3, -3], 'H': [-2, 0, 1, -1, -3, 0, 0, -2, 8, -3, -3, -1, -2, -1, -2, -1, -2, -2, 2, -3], 'I': [-1, -3, -3, -3, -1, -3, -3, -4, -3, 4, 2, -3, 1, 0, -3, -2, -1, -3, -1, 3], 'K': [-1, 2, 0, -1, -3, 1, 1, -2, -1, -3, -2, 5, -1, -3, -1, 0, -1, -3, -2, -2], 'L': [-1, -2, -3, -4, -1, -2, -3, -4, -3, 2, 4, -2, 2, 0, -3, -2, -1, -2, -1, 1], 'M': [-1, -1, -2, -3, -1, 0, -2, -3, -2, 1, 2, -1, 5, 0, -2, -1, -1, -1, -1, 1], 'N': [-2, 0, 6, 1, -3, 0, 0, 0, 1, -3, -3, 0, -2, -3, -2, 1, 0, -4, -2, -3], 'P': [-1, -2, -2, -1, -3, -1, -1, -2, -2, -3, -3, -1, -2, -4, 7, -1, -1, -4, -3, -2], 'Q': [-1, 1, 0, 0, -3, 5, 2, -2, 0, -3, -2, 1, 0, -3, -1, 0, -1, -2, -1, -2], 'R': [-1, 5, 0, -2, -3, 1, 0, -2, 0, -3, -2, 2, -1, -3, -2, -1, -1, -3, -2, -3], 'S': [1, -1, 1, 0, -1, 0, 0, 0, -1, -2, -2, 0, -1, -2, -1, 4, 1, -3, -2, -2], 'T': [0, -1, 0, -1, -1, -1, -1, -2, -2, -1, -1, -1, -1, -2, -1, 1, 5, -2, -2, 0], 'V': [0, -3, -3, -3, -1, -2, -2, -3, -3, 3, 1, -2, 1, -1, -2, -2, 0, -3, -1, 4], 'W': [-3, -3, -4, -4, -2, -2, -3, -2, -2, -3, -2, -3, -1, 1, -4, -3, -2, 11, 2, -3], 'X': [-4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4], 'Y': [-2, -2, -2, -3, -2, -1, -2, -3, 2, -1, -1, -2, -1, 3, -3, -2, -2, 2, 7, -1]}¶blosum62_encode
(data=None, inputCol=None, outputCol=None)[source]¶Encodes a protein sequence by 7 Blosum62
Parameters: | data : DataFrame
inputCol : str
outputCol : str
|
---|---|
Returns: | dataset
|
References
Blosum Matrix https://ftp.ncbi.nih.gov/repository/blocks/unix/blosum/BLOSUM/blosum62.blast.new
get_word2vec_model
()[source]¶Returns a Word2VecModel created by overlapping_ngram_word2vec_encode()
Returns: | model
|
---|
model
= None¶one_hot_encode
(data=None, inputCol=None, outputCol=None)[source]¶One-hot encodes a protein sequence. The one-hot encoding encodes the 20 natural amino acids, plus X for any other residue for a total of 21 elements per residue.
Parameters: | data : DataFrame
inputCol : str
outputCol : str
|
---|
overlapping_ngram_word2vec_encode
(data=None, inputCol=None, outputCol=None, n=None, windowSize=None, vectorSize=None, fileName=None, sc=None)[source]¶Encodes a protein sequence by converting it into n-grams and then transforming it into a Word2Vec feature vector.
If given word2Vec file name, then this function encodes a protein sequence by converting it into n-grams and then transforming it using pre-trained word2Vec model read from that file
Parameters: | data : DataFrame
inputCol : str
outputCol : str
n : int
windowSize : int
vectorSize :int
fileName : str
|
---|---|
Returns: | dataset
|
properties
= {'A': [1.28, 0.05, 1.0, 0.31, 6.11, 0.42, 0.23], 'C': [1.77, 0.13, 2.43, 1.54, 6.35, 0.17, 0.41], 'D': [1.6, 0.11, 2.78, -0.77, 2.95, 0.25, 0.2], 'E': [1.56, 0.15, 3.78, -0.64, 3.09, 0.42, 0.21], 'F': [2.94, 0.29, 5.89, 1.79, 5.67, 0.3, 0.38], 'G': [0.0, 0.0, 0.0, 0.0, 6.07, 0.13, 0.15], 'H': [2.99, 0.23, 4.66, 0.13, 7.69, 0.27, 0.3], 'I': [4.19, 0.19, 4.0, 1.8, 6.04, 0.3, 0.45], 'K': [1.89, 0.22, 4.77, -0.99, 9.99, 0.32, 0.27], 'L': [2.59, 0.19, 4.0, 1.7, 6.04, 0.39, 0.31], 'M': [2.35, 0.22, 4.43, 1.23, 5.71, 0.38, 0.32], 'N': [1.6, 0.13, 2.95, -0.6, 6.52, 0.21, 0.22], 'P': [2.67, 0.0, 2.72, 0.72, 6.8, 0.13, 0.34], 'Q': [1.56, 0.18, 3.95, -0.22, 5.65, 0.36, 0.25], 'R': [2.34, 0.29, 6.13, -1.01, 10.74, 0.36, 0.25], 'S': [1.31, 0.06, 1.6, -0.04, 5.7, 0.2, 0.28], 'T': [3.03, 0.11, 2.6, 0.26, 5.6, 0.21, 0.36], 'V': [3.67, 0.14, 3.0, 1.22, 6.02, 0.27, 0.49], 'W': [3.21, 0.41, 8.08, 2.25, 5.94, 0.32, 0.42], 'X': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'Y': [2.94, 0.3, 6.47, 0.96, 5.66, 0.25, 0.41]}¶property_encode
(data=None, inputCol=None, outputCol=None)[source]¶Encodes a protein sequence by 7 physicochemical properties
Parameters: | data : DataFrame
inputCol : str
outputCol : str
|
---|---|
Returns: | dataset
|
References
Meiler, J., Müller, M., Zeidler, A. et al. J Mol Model (2001) https://link.springer.com/article/10.1007/s008940100038
shifted_3gram_word2vec_encode
(data=None, inputCol=None, outputCol=None, windowSize=None, vectorSize=None, fileName=None, sc=None)[source]¶Encodes a protein sequence as three non-overlapping 3-grams, trains a Word2Vec model on the 3-grams, and then averages the three resulting freature vectors.
Parameters: | data : DataFrame
inputCol : str
outputCol : str
windowSize : int
vectorSize : int
fileName : str
sc : SparkContext
|
---|---|
Returns: | dataset
|
References
Asgari E, Mofrad MRK (2015) Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLOS ONE 10(11): e0141287. doi: https://doi.org/10.1371/journal.pone.0141287