schrodinger.application.combinatorial_diversity.driver_utils module

Provides miscellaneous functionality for combinatorial_diversity_driver.py.

Copyright Schrodinger LLC, All Rights Reserved.

schrodinger.application.combinatorial_diversity.driver_utils.add_property_biasing_options(parser)

Adds property biasing options to the provided parser.

Parameters:parser (argparser.ArgumentParser) – Argument parser object.
schrodinger.application.combinatorial_diversity.driver_utils.adjust_min_pop(min_pop, ndiverse, min_diverse_per_chunk, pool_size)

Adjusts the minimum population per chunk, if necessary, to ensure a minimum number of diverse structures per chunk.

Parameters:
  • min_pop (int) – Requested minimum population per chunk.
  • ndiverse (int) – Total number of diverse structures to select.
  • min_diverse_per_chunk (int) – Minimum allowed number of diverse structures per chunk.
  • pool_size (int) – Total number of structures in the pool.
Returns:

The appropriate minimum population.

Return type:

int

schrodinger.application.combinatorial_diversity.driver_utils.combine_diverse_structures(subjob_names, outfile)

Combines diverse structures from subjobs to the indicated output file.

Parameters:
  • subjob_names (list(str)) – Subjob names.
  • outfile (str) – Output Maestro, SD, CSV or SMILES file. Diverse structures from subjobs must be in the same format.
schrodinger.application.combinatorial_diversity.driver_utils.detect_property_types(infile, max_rows=1000, sticky_missing=False)

Given a .json, .fp, .csv or .smi input file, this function returns a dictionary of property names to property types for all properties, excluding SMILES and title, which are present in the file (.fp, .csv) or automatically calculated (.json, .smi). In the case of .fp and .csv, the first max_rows are examined to deduce property types.

Parameters:
  • infile (str) – Input file (.json, .fp, .csv or .smi).
  • max_rows (int) – The maximum number of rows to examine.
  • sticky_missing (bool) – If True, a property with any missing values will be assigned a type of PropertyType.MISSING. If False, the property type will be deduced from non-missing values.
Returns:

Dictionary of property name to PropertyType.

Return type:

dict{str: diversity_fingerprinter.PropertyType}

schrodinger.application.combinatorial_diversity.driver_utils.extract_subjob_chunks(subjob_name, infile)

Exracts chunk files from the archive <subjob_name>.zip and returns lists of the fingerprint files, numbers of diverse structures, and fingerprint domains that should be supplied to the DiversitySelector object that will operate on each chunk. One of two behaviors will occur:

  1. If the archive contains .csv files, then each fingerprint file will be infile, and the row numbers in each .csv file will be returned as the fingerprint domains.
  2. If the archive contains .fp files, then those fingerprint file names will be returned, and each fingerprint domain will be None.
Parameters:
  • subjob_name (str) – The subjob name.
  • infile (str) – Input file with source of structures. Ignored unless a fingerprint file was supplied as the input file to the parent job and the -nocopy option was specified.
Returns:

Lists of fingerprint file names, numbers of diverse structures and fingerprint domains.

Return type:

list(str), list(int), list(list(int))

schrodinger.application.combinatorial_diversity.driver_utils.generate_fingerprints(infile, outfile, fptype, want_props=False, hba_file=None, hbd_file=None, logger=None)

Generates Canvas fingerprints and, optionally, a default set of physicochemical properties for the structures in a SMILES or CSV file.

Parameters:
  • infile (str) – Input SMILES or CSV file.
  • outfile – Output fingerprint file.
  • fptype (str) – Fingerprint type (see LEGAL_FP_TYPES).
  • want_props (bool or NoneType) – Whether to generate properties. Should be True only for SMILES input.
  • hba_file (str or NoneType) – File with customized hydrogen bond acceptor rules. Ignored if want_props is False.
  • hbd_file (str or NoneType) – File with customized hydrogen bond donor rules. Ignored if want_props is False.
  • logger (logging.Logger or NoneType) – Logger for warning and info messages.
Raises:

ValueError – If properties are requested for CSV input.

schrodinger.application.combinatorial_diversity.driver_utils.generate_fingerprints_from_csv(csv_file, div_fp, fpout, logger)

Generates fingerprints for the SMILES in a .csv file, and writes the fingerprints, titles, properties from columns 2 and beyond and SMILES to an open fingeprint file. Returns the total number of input rows and the total number of fingerprint rows written.

Parameters:
  • csv_file (str) – CSV file name.
  • div_fp (diversity_fingerprinter.DiversityFingerprinter) – Diversity fingerprinter configured to generate only fingerprints.
  • fpout (canvas.ChmCustomOut32) – 32-bit custom fingerprint connection.
  • logger (logging.Logger or NoneType) – Logger for warning and info messages.
Returns:

Tuple of the number of input rows and the number of fingerprints successfully generated and written.

Return type:

int, int

schrodinger.application.combinatorial_diversity.driver_utils.generate_fingerprints_from_smi(smi_file, want_props, div_fp, fpout, logger)

Generates fingerprints and properties for the SMILES in a .smi file, and writes the fingerprints, titles, properties and SMILES to an open fingerprint file. Returns the total number of input rows and the total number of fingerprint rows written.

Parameters:
  • smi_file (str) – SMILES file name.
  • want_props (bool) – Whether properties are being generated.
  • div_fp (diversity_fingerprinter.DiversityFingerprinter) – Diversity fingerprinter configured to generate fingerprints and, if want_props is True, properties.
  • fpout (canvas.ChmCustomOut32) – 32-bit custom fingerprint connection.
  • logger (logging.Logger or NoneType) – Logger for warning and info messages.
Returns:

Tuple of the number of input rows and the number of fingerprints successfully generated and written.

Return type:

int, int

schrodinger.application.combinatorial_diversity.driver_utils.get_available_properties(infile, descriptions=False)

Returns a list of the available properties in the provided input file. If .json or .smi, the properties that are calculated automatically are returned. If .csv, properties in columns 3 and beyond are returned. If .fp, extra data columns other than SMILES are returned.

Parameters:
  • infile (str) – Input file with source of structures.
  • descriptions (bool) – Whether to include descriptions for automatically calculated properties.
Returns:

Property names.

Return type:

list(str)

Raises:

KeyError – If any required columns are missing.

schrodinger.application.combinatorial_diversity.driver_utils.get_distributed_fp_generation_commands(args, nsub)

Returns lists of subjob commands for running distributed fingerprint and property generation.

Parameters:
  • args (argparse.Namespace) – Command line arguments.
  • nsub (int) – Number of subjobs.
Returns:

list of subjob commands.

Return type:

list(list(str))

schrodinger.application.combinatorial_diversity.driver_utils.get_distributed_selection_commands(args, nsub)

Returns lists of subjob commands for running distributed diverse structure selection.

Parameters:
  • args (argparse.Namespace) – Command line arguments.
  • nsub (int) – Number of subjobs.
Returns:

list of subjob commands.

Return type:

list(list(str))

schrodinger.application.combinatorial_diversity.driver_utils.get_infile_type(infile)

Returns the input file type (JSON, FP, CSV, SMI) based on extension, or an empty string if the extension isn’t recognized.

Parameters:infile (str) – Input file with source of structures.
Returns:Input file type or empty string.
Return type:str
schrodinger.application.combinatorial_diversity.driver_utils.get_jobname(args)

Returns an appropriate job name based on args.fsubjob, args.dsubjob, SCHRODINGER_JOBNAME, the job control backend, or the base name of args.infile.

Parameters:args (argparse.Namespace) – Command line arguments
Returns:job name
Return type:str
schrodinger.application.combinatorial_diversity.driver_utils.get_parser()

Creates argparse.ArgumentParser with supported command line options.

Returns:Argument parser object
Return type:argparser.ArgumentParser
schrodinger.application.combinatorial_diversity.driver_utils.get_property_type(value)

Returns the apparent PropertyType of the supplied value.

Parameters:value (str) – The value whose type is to be deduced.
Returns:The apparent type of value.
Return type:diversity_fingerprinter.PropertyType
schrodinger.application.combinatorial_diversity.driver_utils.read_properties(infile, max_rows=1000)

Given a .fp or .csv file, this function returns the list of property names, excluding SMILES and title, followed by the property values for the first max_rows rows.

Parameters:
  • infile (str) – Input .fp or .csv file.
  • max_rows (int) – The maximum number of rows to read.
Returns:

list of property names followed by lists of property values

Return type:

list(str), list(list(str))

Raises:
  • ValueError – If infile is of the wrong type.
  • RuntimeError – If .csv file has inconsistent numbers of values.
schrodinger.application.combinatorial_diversity.driver_utils.read_properties_from_csv_file(infile, max_rows=1000)

Given a .csv file, this function returns the list of property names from columns 2 and beyond, which excludes SMILES and title, followed by the property values for the first max_rows rows.

Parameters:
  • infile (str) – Input .csv file.
  • max_rows (int) – The maximum number of rows to read.
Returns:

list of property names followed by lists of property values

Return type:

list(str), list(list(str))

:raises RuntimeError if .csv file has inconsistent numbers of values.

schrodinger.application.combinatorial_diversity.driver_utils.read_properties_from_fp_file(infile, max_rows=1000)

Given a .fp file, this function returns the list of property names, excluding SMILES and title, followed by the property values for the first max_rows rows.

Parameters:
  • infile (str) – Input .fp file.
  • max_rows (int) – The maximum number of rows to read.
Returns:

list of property names followed by lists of property values

Return type:

list(str), list(list(str))

schrodinger.application.combinatorial_diversity.driver_utils.read_property_filters(filter_file)

Reads property filters from the provided CSV file. The format of each line is: prop_name,min_value,max_value

Parameters:

filter_file (str) – CSV file containing property filters.

Returns:

List of property filters.

Return type:

list(diversity_selector.PropertyFilter)

Raises:
  • RuntimeError – If filter_file is incorrectly formatted.
  • ValueError – If limits are invalid.
schrodinger.application.combinatorial_diversity.driver_utils.split_fingerprints(fp_file, ndiverse, nsub, jobname, inplace=False, min_pop=10000, num_probes=10)

Splits a fingerprint file literally or figuratively into chunks using DiversitySplitter, and places the chunks into a series of zip archives named <jobname>_select_sub_i.zip, where i = 1, 2,…,nsub. Each archive contains one or more chunks to be processed by the associated subjob. Chunk j consists of exactly one of the following two files:

<jobname>_chunk_j.fp - Fingerprints in the chunk (if inplace=False) <jobname>_chunk_j.csv - Row numbers in the chunk (if inplace=True)

The value of inplace determines whether fp_file is literally split into smaller fingerprint files, or figuratively split by way or reporting the 0-based row numbers in each chunk.

In addition to the chunk files, <jobname>_select_sub_i.zip contains the file <jobname>_select_sub_i_manifest.csv, which contains an ordered list of the chunk file names and the number of diverse structures to select from each chunk.

Parameters:
  • fp_file (str) – 32-bit Canvas fingerprint file containing SMILES and any properties to be biased.
  • ndiverse (int) – The total number of diverse structures to select. Must be at least twice as large as the number of chunks.
  • nsub (int) – The desired number of subjobs. This would normally be the number of CPUs over which the job is to be distributed, since finer grained processing is already achieved by assigning one or more chunks to each subjob. The actual number of subjobs run may end up being smaller than this value.
  • jobname (str) – Job name. Determines the names of the archives and chunk files that will be created.
  • inplace (bool) – Controls whether to split fp_file into smaller files (inplace=False), or simply write the row numbers of each chunk (inplace=True).
  • min_pop (int) – Suggested minimum number of structures in each chunk. An adjustment is made, as necessary, to ensure the number of diverse structures per chunk is at least MIN_DIVERSE_PER_CHUNK.
  • num_probes (int) – The number of diverse probe structures used to construct the similarity space from which chunks are defined.
Returns:

tuple of the actual number of subjobs and the number of chunks

Return type:

int, int

schrodinger.application.combinatorial_diversity.driver_utils.split_structures(struct_file, nsub, jobname)

Splits a SMILES or CSV file into nsub chunks, creating the files <jobname>_fpgen_sub_i.<ext>, where i=1,2,…,nsub and <ext> is “smi” or “csv”. Each chunk will contain a minimum of MIN_FP_PER_SUBJOB structures, so the number of chunks actually created may be less than nsub.

Parameters:
  • struct_file (str) – SMILES or CSV file to be split.
  • nsub (int) – The desired number of subjobs.
Returns:

The actual number of files created. Will be <= nsub.

Return type:

int

schrodinger.application.combinatorial_diversity.driver_utils.summarize_property_filters(filter_file)

Generates a string with a summary of the property filters in the provided file.

Parameters:filter_file (str) – CSV file with property filters.
Returns:Summary of property filters.
Return type:str
schrodinger.application.combinatorial_diversity.driver_utils.validate_args(args, startup=False)

Checks the validity of command line arguments.

Parameters:
  • args (argparser.Namespace) – argparser.Namespace with command line arguments
  • startup (bool) – Set to True if validating at starup time
Returns:

tuple of validity and non-empty error message if not valid

Return type:

bool, str

schrodinger.application.combinatorial_diversity.driver_utils.validate_properties(infile, filter_file=None)

Validates the input file and the property filter file to ensure that the required properties are present and numeric.

Parameters:
  • infile (str) – Input file with source of structures.
  • filter_file (str or NoneType) – Property filter file, if any.
Returns:

tuple of validity and non-empty error message if not valid

Return type:

bool, str

schrodinger.application.combinatorial_diversity.driver_utils.write_random_smi_subset(infile, outfile, nsub, rand_seed=1)

Selects a random subset of rows from a .smi file and writes them to another .smi file.

Parameters:
  • infile (str) – Input .smi file.
  • outfile (str) – Output .smi file.
  • nsub (int) – Random subset size.
  • rand_seed (int) – Seed to initialize random number generator.
Raises:

ValueError – If nsub exceeds the number of rows in infile.

schrodinger.application.combinatorial_diversity.driver_utils.write_subjob_selections(fp_files, diverse_subset_rows, outfile, gen_coords=False, v3000=False, logger=None)

Reads diverse structures and properties from the supplied fingerprint files and writes them to the indicated output Maestro, SD, CSV or SMILES file.

Parameters:
  • fp_files (list(str)) – Fingerprint file names.
  • diverse_subset_rows (list(list(int))) – Zero-based lists of row numbers for diverse structures in each fingerprint file.
  • outfile (str) – Output file for diverse structures and properties.
  • gen_coords (bool) – Whether to generate 3D coordinates for Maestro or SD output.
  • v3000 (bool) – Whether to write SD file structures in V3000 format.
  • logger (logging.Logger or NoneType) – Logger for warning messages.