schrodinger.structutils.sort module

A module for sorting structure files by Structure-level property values. The module supports multi-key sorting, ‘block’ sorting, and file merging.

‘sort_criteria’ and ‘intra_block_sort_criteria’ are lists of tuples, where each tuple is an ct-level property dataname and ascending/descending directive for that dataname. If a structure does not have a particular property, it is sorted last (even when sorting in ascending order). This is consistent with Excel and Maestro’s Project table.

‘Block sorting’ is possible by using the auxiliary ‘intra_block_sort_criteria’ sort keys. Block sorting organizes structures into groups by the ‘intra_block_sort_criteria’ set of keys, then orders those groups by their leading member’s ‘sort_criteria’. Put another way, ‘intra_block_sort_criteria’ specifies how to organize structures within a block, and ‘sort_criteria’ specifies how to organize the blocks. If ‘intra_block_sort_criteria’ is None, then a simple multi-key sort is performed using the ‘sort_criteria’. For example, if you have a pose file with multiple poses for each ligand-title, a useful global order is to have all poses with the same title in a contiguous block ordered by Emodel values, and title-blocks ordered by the Glide score of the first member in each title-block.

Copyright Schrodinger, LLC. All rights reserved

schrodinger.structutils.sort.sort_file(file_name, sort_criteria, out_file_name=None, intra_block_sort_criteria=None, no_split=False)

Sort structure file by the values of ct-level properties within the file.

This is the central API that has some logic under the hood to choose a good trade off between disk IO and memory use given the size of the file.

Parameters:
  • file_name (str) – Path to file upon which to operate.
  • sort_criteria (list(tuple)) – List of (m2io dataname, module constant) tuples. These are the primary, secondary, …, keys for sorting the structures, or blocks if intra_block_sort_criteria is defined, and optional ascending/descending constants. e.g.: [(‘s_m_title’, sort.ASCENDING), (‘r_i_glide_docking_score’, sort.ASCENDING)]
  • out_file_name (str) – Output structure file containing the sorted structures. If out_file_name is None, then the input file is clobbered with the results of the sort. Default is to replace input file_name with sorted results.
  • intra_block_sort_criteria (list(tuple)) – Optional list of (m2io dataname, module constant) tuples for block sorting. These are the primary, secondary, …, keys for sorting the structures within blocks, and optional ascending/descending order constants. Default is None, don’t block sort.
schrodinger.structutils.sort.split_file(file_name, max_count=10000, dir=None)

Returns a list of file names generated by splitting the original structures in file_name split into smaller files.

Parameters:
  • file_name (str) – Path to the structure file upon which to operate.
  • max_count (int) – Maximum number of structures per sub-file.
  • dir (str) – Path to the directory where the sub-files are written. The default is the runtime current working directory. There needs to be enough space to store effectively a copy of file_name. For really large files, /tmp is not a good location for most hosts.
schrodinger.structutils.sort.merge_files(file_list, sort_criteria, out_file_name, remove_file_list=True, sort_file_list=False, dir=None)

Combines pre-ordered structure files by their property values. Input files are assumed to be sorted by default. Optionally the files can be sorted by the sort_criteria prior to merging by setting sort_file_list=True.

Note:

This function is not suited for handling pose viewer files because all receptors will be included in the output. See merge_pv_files.

Parameters:
  • file_list (list) – List of paths for the structure files that will be merged.
  • sort_criteria (list) – List of (m2io dataname, module constant) tuples, which are the primary keys for sorting the structures.
  • out_file_name (string) – Path to the structure output file containing all the merged structures.
  • remove_file_list (boolean) – If True then the file names in file_list are removed from disk.
  • sort_file_list (boolean) – If True, then prior to merging, sort the files by ‘sort_criteria’. Default is False, assume the file_list members are already sorted.
  • dir – Unused parameter.
schrodinger.structutils.sort.merge_pv_files(file_list, sort_criteria, out_file_name)

Combines pre-ordered pose viewer structure files by their property values. Input files are assumed to be ordered. Only the receptor from the first pose viewer file is retained.

file_list (list)
List of paths for the pose viewer files that will be merged.
sort_criteria (list of tuples)
List of (m2io dataname, module constant) tuples, which are the primary keys for sorting the ligand structures.
out_file_name (string)
Path to the structure output file containing all the merged structures.
schrodinger.structutils.sort.merge_st_iters(structure_iters, sort_criteria, output_handle)

Combines pre-ordered structure iterators by their property values.

Parameters:
  • structure_iters – List of iterables that emit structure. Emitted structures can be a full structure, a MaestroText structure, or some other object with a property dictionary.
  • sort_criteria (list) – List of (m2io dataname, module constant) tuples, which are the primary keys for sorting the structures.
  • output_handle (An object with an append() method.) – Output stream to which the sorted structures are appended.
Type:

structure_iters: list

schrodinger.structutils.sort.sort_file_in_memory(file_name, sort_criteria, out_file_name=None, intra_block_sort_criteria=None)

Orders the structures in file_name, keeping structures in memory during the sort operation.

Parameters:
  • file_name (str) – Path to file upon which to operate.
  • sort_criteria (List of tuples) – List of (m2io dataname, module constant) tuples, which are the primary keys for sorting the structures and optional sort order constants.
  • out_file_name (str or None) – Output structure file containing the sorted structures. If out_file_name is None then the input file_name is clobbered with the sorted results.
  • intra_block_sort_criteria (List of tuples or None) – List of (m2io dataname, module constant) tuples, which are the properties for sorting the structures within groups, and optional sort order constants.
class schrodinger.structutils.sort.StructureFileSorter(file_name=None, file_index=1, sort_criteria=[('b_glide_receptor', 1), ('r_i_docking_score', 1)], intra_block_sort_criteria=None, keep_structures=False)

Bases: object

A class to sort structure files by ct-level property values.

API Example:

glide_sp_pv_sorter = sort.StructureFileSorter(
    file_name = 'foo_pv.mae',
    file_index = 2
)
glide_sp_pv_sorter.sort()
glide_sp_pv_sorter.writeTopNFromBlock('bar_lib.mae', 2)

st_sorter = sort.StructureFileSorter(
    file_name = "baz.mae",
    sort_criteria = [
        ('r_prop_one', sort.ASCENDING),
        ('i_prop_two', sort.DESCENDING)
    ]
)
st_sorter.sort()
st_sorter.write('baz-sorted.mae')
Variables:
  • structure_index_order (list) – Sorted structure index order. A list of the original file indexes, in the order they appear when sorted by sort_criteria and intra_block_sort_criteria.
  • structure_dict (dict) – File index keys for ct-level property dictionary.
  • structure_block_order (list) – Block_ids sorted by ‘sort_criteria’ keys.
  • structure_count (int) – The number of structures in the file.
  • read_forward_quota (int) – Sort in batches, with this chunk size, instead of with random-access. If the value evaluate as True, the input file is read, forward-only, in small chunks that are sorted in memory. Default is 0, use random-access.

An instance is primarily a data structure where the original file positions are keys for the dictionary of properties. It has auxiliary data structures for tracking the sorted order of the original file positions, and methods to write output files with that order.

Using random-access to re-read the structures in the proper order is typically faster than re-reading in batches. However, read_forward_quota attribute can be set to a positive integer to force batch re-read/writing.

__init__(file_name=None, file_index=1, sort_criteria=[('b_glide_receptor', 1), ('r_i_docking_score', 1)], intra_block_sort_criteria=None, keep_structures=False)

Loads only the structure properties used to sort the file into a dictionary (keyed by file index), but does not do any sorting.

Parameters:
  • file_name (str) – Path to the structure file upon with to operate.
  • file_index (int) – File position at which to start reading file_name.
  • sort_criteria (list(tuple)) – List of m2io datanames and module constant tuples that identify the values for sorting and the sort order.
  • intra_block_sort_criteria (list(tuple) or None) – List of m2io datanames and module constant tuples that identify the values for group sorting, and the sort order. If None, then a simple multi-key sort is performed using the ‘sort_criteria’.
  • keep_structures (bool) – If true then a reference to each structure is kept, keyed by ‘_structure’. The default is False, don’t keep references to the structures.
sort()

Organizes the data structure by self.sort_criteria, and self.intra_block_sort_criteria if it is not None. Assigns attributes for the correct sorted order of the original file positions.

write(out_file_name, index_list=None, dir=None)

Writes structures to disk, no return value.

out_file_name (str)
Path to the output structure file.
index_list (list)
List of file indexes to write, in the order that they should appear in the output file (typically a slice of self.structure_index_order). If None, then all of self.structure_index_order is written.
dir (string)
Path to the directory where the intermediate file is written. The default is the runtime current working directory. There needs to be enough space to store effectively a copy of file_name. For really large files, /tmp is not a good location for most hosts.
writeTopNFromBlock(out_file_name='', max_per_block=1, max_num_block=None)

Write the first max_per_block structures from each block to the output file.

out_file_name (string)
Name of structure file to write.
max_per_block (int)
Number of leading members, from each block, to write to out_file_name. Default is 1.
max_num_block (int)
Number of blocks from which to draw leading members. If the value is None then N max_per_block structure are pulled from each block. Otherwise, the top N max_per_block strucutures from just the top M max_num_blocks blocks are written.
schrodinger.structutils.sort.main()
schrodinger.structutils.sort.parse_arguments()