Source code for mmtfPyspark.filters.containsSequenceRegex

#!/user/bin/env python
'''containsSequenceRegex.py:

This filter returns true if the polymer sequence motif matches the specified regular expression.
Sequence motifs support the following one-letter codes:
- 20 standard amino acids,
- O for Pyrrolysine,
- U for Selenocysteine,
- X for non-standard amino acid

References
----------
- Sequence motif: https://en.wikipedia.org/wiki/Sequence_motif

Examples
--------
Short sequence fragment -- NPPTP:
    The motif search supports wildcard queries by placing a '.' at the
    variable residue position. A query for an SH3 domains using the
    consequence sequence -X-P-P-X-P (where X is a variable residue and P is
    Proline),can be expressed as: .PP.P

Ranges of variable residues are specified by the {n} notation, where n is
the number of variable residues. To query a motif with seven variables
between residues W and G and twenty variable residues between G and L use
the following notation:
    W.{7}G.{20}L

Variable ranges are expressed by the {n,m} notation, where n is the minimum
and m the maximum number of repetitions. For example the zinc finger motif
that binds Zn in a DNA-binding domain can be expressed as:
    C.{2,4}C.{12}H.{3,5}H

The '^' operator searches for sequence motifs at the beginning of a protein
sequence. The following two queries find sequences with N-terminal Histidine
tags:
    ^HHHHHH or ^H{6}

Square brackets specify alternative residues at a particular position.
The Walker (P loop) motif that binds ATP or GTP can be expressed as:
    [AG].{4}GK[ST]
    A or G are followed by 4 variable residues, then G and K, and finally
    S or T

'''
__author__ = "Mars (Shih-Cheng) Huang"
__maintainer__ = "Mars (Shih-Cheng) Huang"
__email__ = "marshuang80@gmail.com"
__version__ = "0.2.0"
__status__ = "Done"

import re


[docs]class ContainsSequenceRegex(object):
    '''This filter returns true if the polymer sequence motif matches the
    specified regular expression.

    Attributes
    ----------
    regularExpression : str
       The regular expression of protein sequence
    '''

    def __init__(self, regularExpression):
        self.regex = regularExpression

    def __call__(self, t):
        structure = t[1]
        entity_list = [b['sequence'] for b in structure.entity_list]

        # This filter passes only single chains and the sequence cannot be empty
        for entity in entity_list:
            if len(entity) > 0:
                if len(re.findall(self.regex, entity)) > 0:
                    return True
        return False