schrodinger.application.combinatorial_diversity.diversity_selector module

This module contains the DiversitySelector class, which combines greedy and stochastic approaches in an optimization algorithm that chooses a diverse subset of reactants from a larger pool of reactants.

The objective function to be minimized is the average nearest neighbor similarity within the subset. Minimization is achieved by starting with a random subset of reactants and then repeatedly attempting to replace the subset member exhibiting the highest nearest neighbor similarity with a randomly chosen member of the pool. Replacements that decrease the average nearest neighbor similarity are always made, whereas replacements that increase it are accepted or rejected in accordance with a Monte Carlo test whose probability of being satisfied decreases from 50% to 1% over the course of the optimization.

For further details, see: Bioorg. Med. Chem. 2012 20, 5379–5387.

Copyright Schrodinger LLC, All Rights Reserved.

class schrodinger.application.combinatorial_diversity.diversity_selector.DiversitySelector(fp_file, opt_cycles=10, convrg_tol=0.001, convrg_cycles=3, mc_tol=0.001, rand_seed=1, logger=None)

Bases: object

Given a file of dendritic fingerprints for a set of reactants, this class employs a greedy, stochastic algorithm to select a diverse subset of reactants.

__init__(fp_file, opt_cycles=10, convrg_tol=0.001, convrg_cycles=3, mc_tol=0.001, rand_seed=1, logger=None)

Constructor taking the name of a reactant fingerprint file and options for minimizing nearest neighbor similarities.

Parameters:
  • fp_file (str) – Input file of reactant fingerprints.
  • opt_cycles (int) – Maximum number of optimization cycles. For a subset of N reactants, an optimization cycle consists of N passes, each of which involves an attempt to replace the reactant with the highest nearest neighbor similarity.
  • convrg_tol (float) – Convergence tolerance on the absolute change in average nearest neighbor similarity. If the change is less than this value, the convergence tolerance is satisfied.
  • convrg_cycles – Number of consecutive cycles over which the convergence tolerance must be satisfied in order to halt the optimization.
  • mc_tol (float) – Monte Carlo criterion. An increase of mc_tol in average nearest neighbor similarity will be accepted with a probability of 50% in the first cycle and 1% in the last cycle.
  • rand_seed (int) – Random seed for initial subset selection and Monte Carlo tests.
  • logger (logging.Logger) – Logger for output of INFO level progress messages. Feedback can be helpful when large subsets are selected, as a given optimization cycle may take minutes or longer if the subset is significantly larger than 1000.
select(num_diverse)

Selects the indicated number of diverse reactants and stores the subset data in the following member variables:

self.subset_rows - 0-based reactant row numbers self.subset_titles - Reactant titles self.subset_smiles - Reactant SMILES

param num_diverse: Desired number of diverse reactants. Note that
computational effort scales quadratically with this number, and values significantly larger than 1000 may lead to very long run times.

type num_diverse: int