sequenceNgrammer.py
This class contians methods for creating overlapping and non-overlapping n-grams of one-letter code sequence (e.g., protein sequences)
ngram
(data, n, outputCol)[source]¶Splits a one-letter sequence column (e.g., protein sequence) into array of overlapping n-grams.
Parameters: | data : dataset
n : int
outputCol : str
|
---|---|
Returns: | dataset
|
Examples
2-gram: IDCGH … => [ID, DC, CG, GH, …]
shifted_ngram
(data, n, shift, outputCol)[source]¶Parameters: | data : dataset
n : int
shift : int
outputCol : str
|
---|---|
Returns: | dataset
|
References
For anapplication of shifted n-grams see: E Asgari, MRK Mofrad, PLoS One. 2015; 10(11): e0141287, doi: https://dx.doi.org/10.1371/journal.pone.0141287
Examples
3-gram(shift=0) : IDCGHTVEDQR … => [IDC, GHT, VED, …] 3-gram(shift=1) : IDCGHTVEDQR … => [DCG, HTV, EDQ, …] 3-gram(shift=2) : IDCGHTVEDQR … => [CGH, TVE, DQR, …]