proteinSequenceEncoder.py
This class encodes a protein sequence into a feature vector. The protein sequence must be present in the input data set, the default column name is “sequence”. The default column name for the feature vector is “features”.
ProteinSequenceEncoder(data=None, inputCol='sequence', outputCol='features')[source]¶Bases: object
This class encodes a protein sequence into a feature vector. The protein sequence must be present in the input data set, the default column name is “sequence”. The default column name for the feature vector is “features”.
Attributes
| data | (DataFrame) input data to be encoded [None] |
| inputCol | (str) name of the input column [sequence] |
| outputCol | (str) name of the output column [features] |
Methods
blosum62_encode([data, inputCol, outputCol]) |
Encodes a protein sequence by 7 Blosum62 |
get_word2vec_model() |
Returns a Word2VecModel created by overlapping_ngram_word2vec_encode() |
one_hot_encode([data, inputCol, outputCol]) |
One-hot encodes a protein sequence. |
overlapping_ngram_word2vec_encode([data, …]) |
Encodes a protein sequence by converting it into n-grams and then transforming it into a Word2Vec feature vector. |
property_encode([data, inputCol, outputCol]) |
Encodes a protein sequence by 7 physicochemical properties |
shifted_3gram_word2vec_encode([data, …]) |
Encodes a protein sequence as three non-overlapping 3-grams, trains a Word2Vec model on the 3-grams, and then averages the three resulting freature vectors. |
AMINO_ACIDS21 = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y']¶blosum62 = {'A': [4, -1, -2, -2, 0, -1, -1, 0, -2, -1, -1, -1, -1, -2, -1, 1, 0, -3, -2, 0], 'C': [0, -3, -3, -3, 9, -3, -4, -3, -3, -1, -1, -3, -1, -2, -3, -1, -1, -2, -2, -1], 'D': [-2, -2, 1, 6, -3, 0, 2, -1, -1, -3, -4, -1, -3, -3, -1, 0, -1, -4, -3, -3], 'E': [-1, 0, 0, 2, -4, 2, 5, -2, 0, -3, -3, 1, -2, -3, -1, 0, -1, -3, -2, -2], 'F': [-2, -3, -3, -3, -2, -3, -3, -3, -1, 0, 0, -3, 0, 6, -4, -2, -2, 1, 3, -1], 'G': [0, -2, 0, -1, -3, -2, -2, 6, -2, -4, -4, -2, -3, -3, -2, 0, -2, -2, -3, -3], 'H': [-2, 0, 1, -1, -3, 0, 0, -2, 8, -3, -3, -1, -2, -1, -2, -1, -2, -2, 2, -3], 'I': [-1, -3, -3, -3, -1, -3, -3, -4, -3, 4, 2, -3, 1, 0, -3, -2, -1, -3, -1, 3], 'K': [-1, 2, 0, -1, -3, 1, 1, -2, -1, -3, -2, 5, -1, -3, -1, 0, -1, -3, -2, -2], 'L': [-1, -2, -3, -4, -1, -2, -3, -4, -3, 2, 4, -2, 2, 0, -3, -2, -1, -2, -1, 1], 'M': [-1, -1, -2, -3, -1, 0, -2, -3, -2, 1, 2, -1, 5, 0, -2, -1, -1, -1, -1, 1], 'N': [-2, 0, 6, 1, -3, 0, 0, 0, 1, -3, -3, 0, -2, -3, -2, 1, 0, -4, -2, -3], 'P': [-1, -2, -2, -1, -3, -1, -1, -2, -2, -3, -3, -1, -2, -4, 7, -1, -1, -4, -3, -2], 'Q': [-1, 1, 0, 0, -3, 5, 2, -2, 0, -3, -2, 1, 0, -3, -1, 0, -1, -2, -1, -2], 'R': [-1, 5, 0, -2, -3, 1, 0, -2, 0, -3, -2, 2, -1, -3, -2, -1, -1, -3, -2, -3], 'S': [1, -1, 1, 0, -1, 0, 0, 0, -1, -2, -2, 0, -1, -2, -1, 4, 1, -3, -2, -2], 'T': [0, -1, 0, -1, -1, -1, -1, -2, -2, -1, -1, -1, -1, -2, -1, 1, 5, -2, -2, 0], 'V': [0, -3, -3, -3, -1, -2, -2, -3, -3, 3, 1, -2, 1, -1, -2, -2, 0, -3, -1, 4], 'W': [-3, -3, -4, -4, -2, -2, -3, -2, -2, -3, -2, -3, -1, 1, -4, -3, -2, 11, 2, -3], 'X': [-4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4], 'Y': [-2, -2, -2, -3, -2, -1, -2, -3, 2, -1, -1, -2, -1, 3, -3, -2, -2, 2, 7, -1]}¶blosum62_encode(data=None, inputCol=None, outputCol=None)[source]¶Encodes a protein sequence by 7 Blosum62
| Parameters: | data : DataFrame
inputCol : str
outputCol : str
|
|---|---|
| Returns: | dataset
|
References
Blosum Matrix https://ftp.ncbi.nih.gov/repository/blocks/unix/blosum/BLOSUM/blosum62.blast.new
get_word2vec_model()[source]¶Returns a Word2VecModel created by overlapping_ngram_word2vec_encode()
| Returns: | model
|
|---|
model = None¶one_hot_encode(data=None, inputCol=None, outputCol=None)[source]¶One-hot encodes a protein sequence. The one-hot encoding encodes the 20 natural amino acids, plus X for any other residue for a total of 21 elements per residue.
| Parameters: | data : DataFrame
inputCol : str
outputCol : str
|
|---|
overlapping_ngram_word2vec_encode(data=None, inputCol=None, outputCol=None, n=None, windowSize=None, vectorSize=None, fileName=None, sc=None)[source]¶Encodes a protein sequence by converting it into n-grams and then transforming it into a Word2Vec feature vector.
If given word2Vec file name, then this function encodes a protein sequence by converting it into n-grams and then transforming it using pre-trained word2Vec model read from that file
| Parameters: | data : DataFrame
inputCol : str
outputCol : str
n : int
windowSize : int
vectorSize :int
fileName : str
|
|---|---|
| Returns: | dataset
|
properties = {'A': [1.28, 0.05, 1.0, 0.31, 6.11, 0.42, 0.23], 'C': [1.77, 0.13, 2.43, 1.54, 6.35, 0.17, 0.41], 'D': [1.6, 0.11, 2.78, -0.77, 2.95, 0.25, 0.2], 'E': [1.56, 0.15, 3.78, -0.64, 3.09, 0.42, 0.21], 'F': [2.94, 0.29, 5.89, 1.79, 5.67, 0.3, 0.38], 'G': [0.0, 0.0, 0.0, 0.0, 6.07, 0.13, 0.15], 'H': [2.99, 0.23, 4.66, 0.13, 7.69, 0.27, 0.3], 'I': [4.19, 0.19, 4.0, 1.8, 6.04, 0.3, 0.45], 'K': [1.89, 0.22, 4.77, -0.99, 9.99, 0.32, 0.27], 'L': [2.59, 0.19, 4.0, 1.7, 6.04, 0.39, 0.31], 'M': [2.35, 0.22, 4.43, 1.23, 5.71, 0.38, 0.32], 'N': [1.6, 0.13, 2.95, -0.6, 6.52, 0.21, 0.22], 'P': [2.67, 0.0, 2.72, 0.72, 6.8, 0.13, 0.34], 'Q': [1.56, 0.18, 3.95, -0.22, 5.65, 0.36, 0.25], 'R': [2.34, 0.29, 6.13, -1.01, 10.74, 0.36, 0.25], 'S': [1.31, 0.06, 1.6, -0.04, 5.7, 0.2, 0.28], 'T': [3.03, 0.11, 2.6, 0.26, 5.6, 0.21, 0.36], 'V': [3.67, 0.14, 3.0, 1.22, 6.02, 0.27, 0.49], 'W': [3.21, 0.41, 8.08, 2.25, 5.94, 0.32, 0.42], 'X': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'Y': [2.94, 0.3, 6.47, 0.96, 5.66, 0.25, 0.41]}¶property_encode(data=None, inputCol=None, outputCol=None)[source]¶Encodes a protein sequence by 7 physicochemical properties
| Parameters: | data : DataFrame
inputCol : str
outputCol : str
|
|---|---|
| Returns: | dataset
|
References
Meiler, J., Müller, M., Zeidler, A. et al. J Mol Model (2001) https://link.springer.com/article/10.1007/s008940100038
shifted_3gram_word2vec_encode(data=None, inputCol=None, outputCol=None, windowSize=None, vectorSize=None, fileName=None, sc=None)[source]¶Encodes a protein sequence as three non-overlapping 3-grams, trains a Word2Vec model on the 3-grams, and then averages the three resulting freature vectors.
| Parameters: | data : DataFrame
inputCol : str
outputCol : str
windowSize : int
vectorSize : int
fileName : str
sc : SparkContext
|
|---|---|
| Returns: | dataset
|
References
Asgari E, Mofrad MRK (2015) Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLOS ONE 10(11): e0141287. doi: https://doi.org/10.1371/journal.pone.0141287