find_generalized_pattern(sequence_list,
pattern,
validate_pattern=False)
|
|
Finds a generalized sequence pattern within specified sequences.
NOTE: The search is performed in the forward direction only.
@param: sequence_list
@type: list of sequences to search.
Each sequence is a dictionary including amino acid strings
and associated data. The amino acid string should be ungapped.
The dictionary should include 'amino_acids' key and optionally
'secondary_structure', 'solvent_accessibility' and 'flexibility'
keys:
- sequence is expressed as uppercase 1-letter code:
sequence['amino_acids'] = 'ACDEFGHIKLMNPQRSTVWY'
(sequence in 1-letter code, mandatory)
- secondary structure usese 'h', 'e' and ' ' symbols:
sequence['secondary_structure'] = 'hhhhh eeeee '
- solvent accessibility uses 'e' (exposed) and ' ' (buried)
symbols:
sequence['solvent_accessibiliy' = 'eeeee eeeee '
- flexibility uses 'f' (flexible) and ' ' (not flexible)
symbols:
sequence['flexibility'] = ' fffff fffff'
@param: pattern
@type: Pattern defined using extended PROSITE syntax.
- standard IUPAC one-letter codes are used for all amino acids
- each element in a pattern is separated using '-' symbol
- symbol 'x' is used for position where any amino acid is accepted
- ambiguities are listed using the acceptable amino acids between
square brackets, e.g. [ACT] means Ala, Cys or Thr
- amino acids not accepted for a given position are indicated
by listing them between curly brackets, e.g. {GP} means 'not Gly
and not Pro'
- repetition is indicated using parentheses, e.g. A(3) means
Ala-Ala-Ala, x(2,4) means between 2 to 4 any residues
- the following lowercase characters can be used as additional
flags:
- 'x' means any amino acid
- 'a' means acidic residue: [DE]
- 'b' means basic residue: [KR]
- 'o' means hydrophobic residue: [ACFILPWVY]
- 'p' means aromatic residue: [WYF]
- 's' means solvent exposed residue
- 'h' means helical residue
- 'e' means extended residue
- 'f' means flexible residue
- Each position can optionally by followed by @<res_num> expression
that will match the position with a given residue number.
- Entire pattern can be followed by :<index> expression that defines
a 'hotspot' in the pattern. When the hotspot is defined, only
a single residue corresponding to (pattern_match_start+index-1)
will be returned as a match. The index is 1-based and can be used
to place the hotspot outside of the pattern (can also be
a negative number).
Pattern examples:
- N-{P}-[ST] : Asn-X-Ser or Thr (X != Pro)
- N[sf]-{P}[sf]-[ST][sf] : as above, but all residues flexible
OR solvent exposed
- Nsf-{P}sf-[ST]sf : as above, but all residues flexible
AND solvent exposed
- Ns{f} : Asn solvent exposed AND not in flexible region
- N[s{f}] : Asn solvent exposed OR not in flexible region
- [ab]{K}{s}f : acidic OR basic, with exception of Lys,
flexible AND not solvent exposed
- Ahe : Ala helical AND extended - no match possible
- A[he] : Ala helical OR extended
- A{he} : Ala coiled chain conformation (not helical nor extended)
- [ST] : Ser OR Thr
- ST : Ser AND Thr - no match possible
@type validate_pattern: boolean
@param validate_pattern: If True, the function will validate the pattern
without performing the search (the sequences parameter will be
ignored) and return True if the pattern is valid, or False
otherwise. The default is False.
@rtype: list of lists of integer tuples or False if the pattern is invalid
@return: None if the specified input pattern was incorrect.
Otherwise, it returns a list of lists of matches for each
input sequence. Each match is a (start, end) tuple where
start and end are matching sequence positions.
|