Package schrodinger :: Package protein :: Module sequence
[hide private]
[frames] | no frames]

Module sequence

This module allows to iterate over all sequences in a given protein CT, and to iterate over residues of each (in sequence order).

Classes [hide private]
  Sequence
Class representing a sequence of protein residues.
Functions [hide private]
 
get_structure_sequences(st)
Iterates over all sequences in the given structure.
 
find_generalized_pattern(sequence_list, pattern, validate_pattern=False)
Finds a generalized sequence pattern within specified sequences.
 
convert_sequence_for_pattern_search(seq, sasa_by_atom=None)
Converts a sequence object to dictonary required by find_generalized_pattern function.
list of lists of integer tuples or None
find_pattern(seq, pattern)
Find pattern matches in a specified Sequence object.
Function Details [hide private]

find_generalized_pattern(sequence_list, pattern, validate_pattern=False)

 

Finds a generalized sequence pattern within specified sequences.
NOTE: The search is performed in the forward direction only.

@param: sequence_list
@type: list of sequences to search.

       Each sequence is a dictionary including amino acid strings
       and associated data. The amino acid string should be ungapped.
       The dictionary should include 'amino_acids' key and optionally
       'secondary_structure', 'solvent_accessibility' and 'flexibility'
       keys:

       - sequence is expressed as uppercase 1-letter code:

           sequence['amino_acids'] = 'ACDEFGHIKLMNPQRSTVWY'
               (sequence in 1-letter code, mandatory)

       - secondary structure usese 'h', 'e' and ' ' symbols:

           sequence['secondary_structure'] = 'hhhhh     eeeee     '

       - solvent accessibility uses 'e' (exposed) and ' ' (buried)
           symbols:

           sequence['solvent_accessibiliy' = 'eeeee     eeeee     '

       - flexibility uses 'f' (flexible) and ' ' (not flexible)
           symbols:

           sequence['flexibility'] = '     fffff     fffff'

@param: pattern
@type: Pattern defined using extended PROSITE syntax.

       - standard IUPAC one-letter codes are used for all amino acids

       - each element in a pattern is separated using '-' symbol

       - symbol 'x' is used for position where any amino acid is accepted

       - ambiguities are listed using the acceptable amino acids between
         square brackets, e.g. [ACT] means Ala, Cys or Thr

       - amino acids not accepted for a given position are indicated
         by listing them between curly brackets, e.g. {GP} means 'not Gly
         and not Pro'

       - repetition is indicated using parentheses, e.g. A(3) means
         Ala-Ala-Ala, x(2,4) means between 2 to 4 any residues

       - the following lowercase characters can be used as additional
         flags:

           - 'x' means any amino acid
           - 'a' means acidic residue: [DE]
           - 'b' means basic residue: [KR]
           - 'o' means hydrophobic residue: [ACFILPWVY]
           - 'p' means aromatic residue: [WYF]
           - 's' means solvent exposed residue
           - 'h' means helical residue
           - 'e' means extended residue
           - 'f' means flexible residue

       - Each position can optionally by followed by @<res_num> expression
         that will match the position with a given residue number.

       - Entire pattern can be followed by :<index> expression that defines
         a 'hotspot' in the pattern. When the hotspot is defined, only
         a single residue corresponding to (pattern_match_start+index-1)
         will be returned as a match. The index is 1-based and can be used
         to place the hotspot outside of the pattern (can also be
         a negative number).

    Pattern examples:

        - N-{P}-[ST] : Asn-X-Ser or Thr (X != Pro)
        - N[sf]-{P}[sf]-[ST][sf] : as above, but all residues flexible
          OR solvent exposed
        - Nsf-{P}sf-[ST]sf : as above, but all residues flexible
          AND solvent exposed
        - Ns{f} : Asn solvent exposed AND not in flexible region
        - N[s{f}] : Asn solvent exposed OR not in flexible region
        - [ab]{K}{s}f : acidic OR basic, with exception of Lys,
          flexible AND not solvent exposed
        - Ahe : Ala helical AND extended - no match possible
        - A[he] : Ala helical OR extended
        - A{he} : Ala coiled chain conformation (not helical nor extended)
        - [ST] : Ser OR Thr
        - ST : Ser AND Thr - no match possible

@type validate_pattern: boolean
@param validate_pattern: If True, the function will validate the pattern
         without performing the search (the sequences parameter will be
         ignored) and return True if the pattern is valid, or False
         otherwise. The default is False.

@rtype: list of lists of integer tuples or False if the pattern is invalid
@return: None if the specified input pattern was incorrect.
         Otherwise, it returns a list of lists of matches for each
         input sequence. Each match is a (start, end) tuple where
         start and end are matching sequence positions.

convert_sequence_for_pattern_search(seq, sasa_by_atom=None)

 

Converts a sequence object to dictonary required by find_generalized_pattern function. Because the conversion can be time consuming, it should be done once per sequence.

Optionally a list of atom SASAs for each atom in the CT can be specified. If it's not specified, it will get calculated by calling analyze.calculate_sasa_by_atom().

find_pattern(seq, pattern)

 

Find pattern matches in a specified Sequence object. Returns a list of matching positions.

Parameters:
  • pattern (string) - Sequence pattern. The syntax is described in find_generalized_pattern.
Returns: list of lists of integer tuples or None
None if the specified input pattern was incorrect. Otherwise, it returns a list of lists of matches for each residue position in the input structure. Each match is a (start, end) tuple where start and end are matching sequence positions. If 'hotspot' is specified then start = end.