schrodinger.protein.alignment module

Classes for working with sequences containing alignment information (gaps) and collections thereof.

Copyright Schrodinger, LLC. All rights reserved.

class schrodinger.protein.alignment.AlignmentSignals

Bases: PyQt5.QtCore.QObject

A collection of signals that can be emitted by an alignment

Variables:
  • sequencesAboutToBeInserted (QtCore.pyqtSignal) – A signal emitted before sequences are inserted into the alignment. Emitted with: (The index of the first sequence to be inserted, The index of the last sequence to be inserted)
  • sequencesInserted (QtCore.pyqtSignal) – A signal emitted after sequences are inserted into the alignment. Emitted with: (The index of the first sequence inserted, The index of the last sequence inserted)
  • sequencesAboutToBeRemoved (QtCore.pyqtSignal) – A signal emitted before sequences are removed from the alignment. Emitted with: (The index of the first sequence to be removed, The index of the last sequence to be removed)
  • sequencesRemoved (QtCore.pyqtSignal) – A signal emitted after sequences are removed from the alignment. Emitted with: (The index of the first sequence removed, The index of the last sequence removed)
  • sequenceResiduesChanged (QtCore.pyqtSignal) – A signal emitted after the contents of a sequence have changed. Note that this signal may also be emitted in response to a sequence changing length, as positions in the alignment may switch from blank to occupied or vice versa. Emitted with: (The modified sequence, The position of the first modified residue, The position of the last modified residue)
  • sequencesAboutToBeReordered – Signal emitted before reordering sequences
  • sequencesReordered – Signal emitted after sequences have been reordered
  • sequenceNameChanged (QtCore.pyqtSignal) – A signal emitted after a sequence has changed names. Emitted with: (The modified sequence)
  • annotationTitleChanged (QtCore.pyqtSignal) – A signal emitted after a sequence’s annotation has changed titles. Emitted with: (The sequence whose annotation title has been modified)
  • alignmentLengthAboutToChange (QtCore.pyqtSignal) – A signal emitted before the alignment changes length. Emitted with: (The current length of the alignment, The new length of the alignment)
  • alignmentLengthChanged (QtCore.pyqtSignal) – A signal emitted after the alignment changes length. Emitted with: (The old length of the alignment, The current length of the alignment)
  • residuesRemoved (QtCore.pyqtSignal) – A signal emitted with a residue selection of removed residues. Note that this signal will be only be emitted once even if residues are removed from multiple sequences. In addition, each individual sequence will emit a lengthChanged signal.
  • residuesAdded (QtCore.pyqtSignal) – A signal emitted with residue selection of added residues. Note that this signal will be only be emitted once even if residues are added to multiple sequences. In addition, each individual sequence will emit a lengthChanged signal.
  • sequenceVisibilityChanged (QtCore.pyqtSignal) – A signal emitted when visibility of a sequence changes. Emitted with: (the sequence whose visibility is changing, the index of the sequence)
  • sequenceStructureChanged (QtCore.pyqtSignal) – A signal emitted when structure of a sequence changes. Emitted with: (the sequence whose visibility is changing, the index of the sequence)
  • alignmentAboutToBeCleared (QtCore.pyqtSignal) – A signal emitted just before all sequences are removed from the alignment.
  • alignmentCleared (QtCore.pyqtSignal) – A signal emitted just after all sequences have been removed from the alignment.
Type:

sequencesAboutToBeReordered: QtCore.pyqtSignals

Type:

sequencesReordered: QtCore.pyqtSignals

alignmentAboutToBeCleared
alignmentCleared
alignmentLengthAboutToChange
alignmentLengthChanged
annotationTitleChanged
emitAnnTitleChanged()
emitSeqNameChanged()
emitSeqResChanged(first_res, last_res)
residuesAdded
residuesRemoved
sequenceNameChanged
sequenceResiduesChanged
sequenceStructureChanged
sequenceVisibilityChanged
sequencesAboutToBeInserted
sequencesAboutToBeRemoved
sequencesAboutToBeReordered
sequencesInserted
sequencesRemoved
sequencesReordered
class schrodinger.protein.alignment.BaseAlignment(sequences=None)

Bases: PyQt5.QtCore.QObject

Abstract base class for classes which handle alignment of various sequences and corresponding annotations.

This is a pure domain object intended to make it easy to work with aligned collections of sequences.

Some methods are decorated with @msv_utils.const in order to make it easy to write a wrapper for this class that supports undo/redo operations.

addGaps(gap_indices)

Adds gaps to the alignment

Note:the length of the gap_indices list must match the number of sequences in the alignment.
Parameters:gap_indices – A list of lists of gap indices, one for each sequence in the alignment.
addOrReplaceSeqs(seqs, identifier_func)

Given seqs and an identifier_func, replaces seqs in the alignment matching the identifier_func and appends any additional seqs to the alignment

Parameters:
  • seqs (iterable of schrodinger.protein.sequence. Sequence) – The sequences to add to the alignment
  • identifier_func (callable) – A key function to uniquely identify sequences
addResidues(selection)

Adds the specified residues to the alignment

Parameters:selection (ResidueSelection) – A selection of residues
addSeq(seq, index=None)
Parameters:
  • seq (sequence.Sequence) – The sequence to add
  • start (int) – The index at which to insert; if None, seq is appended
addSeqs(sequences, start=None)

Add multiple sequences to the alignment

Parameters:
  • sequences (list of sequence.Sequence) – Sequences to add
  • start (int) – The index at which to insert; if None, seq is appended
addSeqsByIndices(seq_index_map)

Insert a sequences at the specified indices in the alignment. The sequences will be added from lowest to highest to allow for specification of indexes that may be out of range of the current alignment until lower-indexed sequences have been added. Note that indexes that remain out of range will result in their corresponding sequence simply being appended to the end of the alignment.

Parameters:seq_index_map – Map of insertion indices to sequences to be added.
alignmentLocked()

Whether every column in the alignment is locked

Return type:bool
Returns:Whether the alignment is locked
all_annotations

Return a list of all annotations types in this alignment

appendSubalignment(aln)

Append an alignment to this one

Parameters:aln (BaseAlignment or list of Sequence) – The alignment to append
calculateMatrix()

Calculates a substitution matrix based on the current alignment.

clear()

Remove all sequences and locked columns from the alignment.

columnHasAllSameResidues(index)

Return whether or not the column at a specified index has all the same residues (excluding gaps).

Note that if any unknown residues are present, the column will not be considered to be of all the same residue type.

Parameters:index (int) – Index to check for uniformity
Returns:True if the column is of uniform identity, False otherwise.
Return type:bool
columns(omit_gaps=False)

Returns a range of alignment columns or all columns if indices are not specified.

Parameters:omit_gaps (bool) – Whether to omit gaps
getAlignedBlocks()

Returns the indices of aligned blocks (regions without gaps).

getAlignmentQualityByColumn(col_index)

Retrieve the alignment quality at a given column and update the cache if necessary.

Parameters:col_index (int) – Column of the residue
getColumn(index, omit_gaps=False)

Returns single alignment column at index position. Optionally, filters out gaps if omit_gaps is True.

Parameters:
  • index (int) – The index in the alignment
  • omit_gaps (bool) – Whether to omit the gaps
Return type:

list

Returns:

Single alignment column at index position.

getDiscontinuousSubalignment(indices)

Given a list of indices, return a new alignment of sequences made up of the residues at those specified indices within this alignment.

Parameters:indices (list of (int, int)) – List of (seq index, residue index) tuples
Returns:A new subalignment
Return type:BaseAlignment
getEntropy(frequencies)

Returns an alignment length array of residue entropy scores

getFrequencies(exclude=None, consider_gaps=False)

Returns a dict mapping residues types to the frequency in the alignment

Parameters:
  • exclude (list) – A list of sequences to exclude
  • consider_gaps (bool) – Whether to consider gaps in calculating frequences
getGapIndicesByKeyFunc(gap_info, key_func)

Converts a gap_info list and key func into a list of gap indices

Gap information consists of (key for residue, number of gaps preceding it)

Parameters:
  • gap_info (list) – list of list of tuples
  • key_func (function) – callable that takes a residue and returns a key
Return type:

list of lists of int

Returns:

A list of gaps for each sequence in the alignment

getGapOnlyColumns()

Returns a list of lists of indices for unlocked columns that contain only gaps

Return type:list
Returns:List of list of indices
getGaps()

Returns a list of gap indices lists

Return type:list
Returns:A list of lists of ints
getGapsByKeyFunc(key_func)

Given a key function to uniquely identify residues, build a list of lists with gap information for each sequence in the alignment

Gap information consists of (key for residue, number of gaps preceding it)

Parameters:key_func (function) – callable that takes a residue and returns a key
Return type:list
Returns:A list of lists with gaps information for each sequence in the alignment
getGlobalAnnotationData(index, annotation)

Returns column-level annotation data at an index in the alignment

Parameters:
  • index (int) – The index in the alignment
  • annotation (enum.Enum) – An enum representing the requested annotation, if any
getHiddenSeqCount()

Return the number of sequences in the alignment that have an associated PT entry ID but are not currently visible in the Workspace.

Returns:number of hidden sequences
Return type:int
getIdentities(omit_gaps=True)

Returns an alignment-length list of bools indicating which columns have identical residues

Parameters:omit_gaps (bool) – Whether gaps should be excluded from a column.
getRedundantSequences(value)

Returns the indices of sequences below a specified identity threshold value.

Returns:The indices of sequences in the alignment below specified identity threshold
Return type:list of int
getReferenceSeq()

Returns the sequence that has been set as reference sequence or None if there is no reference sequence.

Returns:The reference sequence or None
Return type:Sequence or None
getResidueData(seqnum, index, annotation=None)

Returns residue-level data for the specified sequence at the specified index in the alignment, or None if no data is available.

If annotation is specified, the residue-level information for the residue is returned. If not, the residue object itself is returned.

Parameters:
  • seqnum (int) – The index of the sequence in the alignment
  • index (int) – The index of the residue in the sequence
  • annotation (enum.Enum) – An enum representing the requested annotation, if any
getResidueIndices(residues)

Returns the indices (in the alignment) of the specified residues

Parameters:residues
Return type:list of (sequence index, residue index) tuples
Returns:A list of (int, int)
static getReversedSequenceOrdering(seq_indices)

Given a new ordering for sequences in an alignment, return an ordering that will restore the original order of sequences.

Given a an alignment [a, b, c, d, e] an ordering of [3, 1, 4, 2, 0] will rearrange the sequences into [d, b, e, c, a]. We need an ordering of [4, 1, 3, 0, 2] to restore the original arrangement of [a, b, c, d, e]. This method is used in undo operations.

Parameters:seq_indices – A list with the new indices for sequences
Type:list of int
Return type:list of int
Returns:An ordering list that will restore the original arrangement of sequences in the alignment
getSeqIndex(seq)
Parameters:seq (sequence.Sequence) – The requested sequence
Return type:int
Returns:The index of the requested sequence
getSimilarityScore(seq)

Returns a sequence length array of similarity scores against the reference sequence

Gaps in the sequences are coded as None values.

getSubalignment(start, end)

Return another alignment containing the elements within the specified start and end indices

Parameters:
  • start (int) – The index at which the subalignment should start
  • end (int) – The index at which the subalignment should end
Return type:

BaseAligment

Returns:

An alignment corresponding to the start and end point specified

getTerminalGaps()

Returns the indices of terminal gaps in all the sequences

Return type:list
Returns:A list of lists of ints
getVisibleSeqCount()

Return the number of visible sequences in the alignment.

Returns:number of visible sequences
Return type:int
global_annotations

Returns the alignment-level annotations available for the alignment

insertSubalignment(aln, start)

Insert an alignment into the current alignment at the specified index

Parameters:
  • aln (BaseAlignment) – The alignment to insert
  • start (int) – The index at which to insert the alignment
isReferenceSeq(seq)

Return whether or not a sequence is the reference sequence.

Parameters:seq (Sequence) – Sequence to check
Returns:True if the sequence is the reference sequence, False otherwise.
Return type:bool
iterResidues()

Yields a sequence of schrodinger.protein.residue.Residue objects in the alignment, omitting gaps.

lockedColumns()

Returns a set with indices of locked columns.

Return type:set
Returns:A set of indices

The set is a copy of our internal set, so modifying it has no effect on our private attribute

makeResidueSelection(residues)

Returns a residue selection object matching the specified residues

Parameters:residues (list) – A list of residues
Return type:ResidueSelection
Returns:An object containing selection information
max_length
classmethod mergePairwiseAlignments(sequence_pairs)

Merges several pairwise alignments into one flat alignment while preserving relative residue positions. The original sequences are modified. After executing this function, all reference sequences (first pair members) will be identical.

Example. Let’s assume we have three pairwise query/template alignments:

Q1: ACDEFGHI T1: ~~DEF~~~

Q2: ~~~ACDEFGHI T2: TTT~~DE~~H~

Q3: ACDEF~~GHI~ T3: ACD~~PPGH~Y

Note the reference sequence is identical in all cases, but it has gaps in different positions. After running mergePairwiseAlignments, the result is:

Q1: ~~~ACDEF~~GHI T1: ~~~~~DEF~~~~~

Q2: ~~~ACDEF~~GHI T2: TTT~~DE~~~~H~

Q3: ~~~ACDEF~~GHI~ T3: ~~~ACD~~PPGH~Y

Now the queries have gaps in identical positions, and aligned residues are in positions equivalent to these in original alignments.

Parameters:sequence_pairs (list of list of sequences) – List of [query, template] pairs.
minimizeAlignment()

Minimizes the alignment, i.e. removes all gaps from the gap-only columns.

mutateResidues(mutations)

Mutate the residues at the specified locations in the alignment

Note that the individual sequences will emit a signal announcing the mutation

Parameters:mutations (list of tuples (seq_i, res_i, replacement)) –
static padAlignment(aln)

Insert gaps into an alignment so that it forms a rectangular block

Parameters:aln (schrodinger.protein.Alignment) – An alignment to pad
removeAllGaps()

Removes all the gaps of the sequences in the alignment. This also unlocks all columns

removeAllSeqs()

Clears the entire alignment of sequences

removeGaps(gap_indices)
Parameters:gap_indices (list of list of ints) – Indices of gaps to remove
removeResidues(residues)

Removes the specified residues from the alignment and emits the signals.residuesRemoved signal with the selection

Parameters:residues (list) – The residues to remove
removeSeq(seq)

Remove a sequence from the alignment

Parameters:seq (sequence.Sequence) – The sequence to remove
removeSeqByIndex(index)

Remove a Sequence from the alignment

Parameters:index (int) – The index of the sequence to remove
removeSeqs(seqs)

Remove multiple sequences from the alignment

removeSubalignment(start, end)

Remove a block of the subalignment from the start to end points, including column locks in that region

Parameters:
  • start (int) – The start index of the columns to remove
  • end (int) – The end index of the columns to remove
removeTerminalGaps()

Removes the gaps from the ends of every sequence in the alignment

reorderSequences(seq_indices)

Reorder the sequences in the alignment using the specified list of indices.

In the undoable version of this class, the private function is needed to perform the operation in an undoable operation.

Parameters:seq_indices – A list with the new indices for sequences
Type:list of int
Raises:ValueError – In the event that the list of indices does not match the length of the alignment
replaceResiduesWithGaps(residues)

Replaces the specified residues with gaps

Parameters:residues (list) – A list of residues to replace with gaps
replaceSeq(seq, index)

Replace the sequence at the specified index with the elements in the specified sequence

Note that this leaves the original sequence itself intact so that it continues to be monitored

Parameters:
  • seq (iterable of schrodinger.protein.residue. Residue) – The sequence whose elements we use
  • index (int) – The index of the sequence to replace
replaceSubalignment(aln, start, end)

Replace a subsection of the alignment indicated by start and end indices with the specified alignment

Parameters:
  • aln (BaseAlignment) – The alignment to insert
  • start (int) – The index at which to insert the alignment
resMatchesReferenceRes(row_index, col_index)

Return True if the residue of a sequence at a column in the alignment matches the reference residue.

Parameters:
  • row_index (int) – Index of the sequence containing the residue to check
  • col_index (int) – Column of the residue to check
Returns:

True if the residue at the specified index matches the reference, False otherwise.

Return type:

bool

seq_annotations

Returns the sequence-level annotations available for sequences held in the alignment

setAllLocks(lock=True)

Convenience method to set all the locks to the specified lock state at once

Parameters:lock (bool) – Whether to lock or unlock the specified columns
setGaps(gap_indices)

Sets gaps on the alignment

Parameters:gap_indices – A list of lists of gap indices, one for each sequence in the alignment.
setLockedColumns(columns, lock=True, reset=False)

Sets the columns to the specified lock state

Parameters:
  • columns (iterable) – an iterable of columns to set, specified by index
  • lock (bool) – Whether to lock or unlock columns
  • reset (bool) – Whether to reset the locks or add to existing ones
setReferenceSeq(seq)

Set the specified sequence as the reference sequence.

Parameters:seq (sequence) – Sequence to set as reference sequence
sort(key, reverse=False)

Sort the alignment by the specified criteria.

NOTE: Query sequence is not included in the sort.

Parameters:
  • key (function) – A function that takes a sequence and returns a value to sort by for each sequence.
  • reverse – Whether to sort in reverse (descending) order.
class schrodinger.protein.alignment.NucleicAcidAlignment(sequences=None)

Bases: schrodinger.protein.alignment.BaseAlignment

class schrodinger.protein.alignment.ProteinAlignment(*args, **kwargs)

Bases: schrodinger.protein.alignment.BaseAlignment

addDisulfideBond(res1, res2)

Add a disulfide bond if both residues’ sequences are in the alignment

Parameters:
Raises:

ValueError – if either sequence is not in the alignment

disulfide_bonds
findPattern(pattern)

Finds a specified PROSITE pattern in all sequences.

Parameters:pattern (str) – PROSITE pattern to search in sequences. See protein.sequence.find_generalized_pattern for documentation.
Returns:List of matching residues
Return type:list of protein.residue.Residue
static fromClustalFile(file_name)

Returns alignment read from file in Clustal .aln format preserving order of sequences.

Parameters:file_name (str) – Source file name.
Raises:IOError – If output file cannot be read.
Return type:ProteinAlignment
Returns:An alignment
Note:The alignment can be empty if no sequence was present in the input file.
static fromFastaFile(file_name)

Returns alignment read from file in Clustal .aln format preserving order of sequences.

Raises:IOError – If the input file cannot be read.
Return type:ProteinAlignment
Returns:Read alignment. The alignment can be empty if no sequence was present in the input file.
static fromFastaString(lines)

Read sequences from FASTA-formatted text, creates sequences and appends them to alignment. Splits sequence name from the FASTA header.

Parameters:lines (list of str) – list of strings representing FASTA file
Return type:ProteinAlignment
Returns:The alignment
static fromFastaStringList(strings)

Return an alignment object created from an iterable of sequence strings

Parameters:strings (Iterable of strings) – Sequences as iterable of strings (1D codes)
global_annotations
removeDisulfideBond(res1, res2)

Remove a disulfide bond if both residues’ sequences are in the alignment

Parameters:
Raises:

ValueError – if either sequence is not in the alignment

seq_annotations
setHydrophobicityWindowPadding(window_padding)

Sets hydrophobicity window padding value for each sequence in the protein alignment.

Parameters:window_padding (int) – number of values to pad each average with
setIsoelectricPointWindowPadding(window_padding)

Sets isoelectric point window padding value for each sequence in the protein alignment.

Parameters:window_padding (int) – number of values to pad each average with
toClustalFile(file_name, use_unique_names=True)

Writes aln to a Clustal alignment file.

Raises:

IOError – If output file cannot be written.

Parameters:
  • file_name (str) – Destination file name.
  • use_unique_names (bool) – If True, write unique name for each sequence.
toFastaFile(file_name, use_unique_names=True, maxl=50)

Write self to specified FASTA file

Raises:IOError – If output file cannot be written.
toFastaString(use_unique_names=True, maxl=50)

Convert ProteinAlignment object to list of sequence strings

Parameters:aln (ProteinAlignment) – Alignment data
toFastaStringList()

Convert self to list of fasta sequence strings

Return type:list
Returns:list of str
class schrodinger.protein.alignment.ResidueSelection(residues, indices)

Bases: tuple

indices

Alias for field number 1

residues

Alias for field number 0