mmtfPyspark.ml.sequenceNgrammer module

sequenceNgrammer.py

This class contians methods for creating overlapping and non-overlapping n-grams of one-letter code sequence (e.g., protein sequences)

ngram(data, n, outputCol)[source]

Splits a one-letter sequence column (e.g., protein sequence) into array of overlapping n-grams.

Parameters:

data : dataset

input dataset with column “sequence”

n : int

size of the n-gram

outputCol : str

name of the output column

Returns:

dataset

output dataset with appended ngram column

Examples

2-gram: IDCGH … => [ID, DC, CG, GH, …]

shifted_ngram(data, n, shift, outputCol)[source]
Splits a one-letter sequence column (e.g., protein sequence)
into array of non-overlapping n-grams. To generate all possible n-grams, this method needs to be called n times with shift parameters {0, …, n-1}.
Parameters:

data : dataset

input dataset with column “sequence”

n : int

size of the n-gram

shift : int

start index for the n-gram

outputCol : str

name of the output column

Returns:

dataset

output dataset with appended ngram column

References

For anapplication of shifted n-grams see: E Asgari, MRK Mofrad, PLoS One. 2015; 10(11): e0141287, doi: https://dx.doi.org/10.1371/journal.pone.0141287

Examples

3-gram(shift=0) : IDCGHTVEDQR … => [IDC, GHT, VED, …] 3-gram(shift=1) : IDCGHTVEDQR … => [DCG, HTV, EDQ, …] 3-gram(shift=2) : IDCGHTVEDQR … => [CGH, TVE, DQR, …]