mmtfPyspark.webfilters.sequenceSimilarity module

sequenceSimilarity.py

This filter returns entries that pass the sequence similarity search criteria. Searches protein and nucleic acid sequences using the BLAST. PSI-BLAST is used to find more distantly related protein sequences.

The E value, or Expect value, is a parameter that describes the number of hits one can expect to see just by chance when searching a database of a particular size. For example, an E value of one indicates that a result will contain one sequence with similar score simply by chance. The scoring takes chain length into consideration and therefore shorter sequences can have identical matches with high E value.

The Low Complexity filter masks low complexity regions in a sequence to filter out avoid spurious alignments.

Sequence Identity Cutoff (%) filter removes entries of low sequence similarity. The cutoff value is a percentage value between 0 to 100.

Note: sequences must be at least 12 residues long. For shorter sequences try the Sequence Motif Search.

References

  • BLAST: BLAST: Sequence searching using NCBI’s BLAST (Basic Local Alignment Search Tool) Program , Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215: 403-410 (1990)
  • PSI-BLAST: Sequence searching to detect distantly related evolutionary relationships using NCBI’s PSI-BLAST (Position-Specific Iterated BLAST).
class SequenceSimilarity(sequence, searchTool='blast', eValueCutoff=10.0, sequenceIdentityCutoff=0, maskLowComplexity=True)[source]

Bases: object

Methods

__call__(t) Call self as a function.
BLAST = 'blast'
PSI_BLAST = 'psi-blast'