schrodinger.application.combinatorial_diversity.diversity_selector module¶
This module contains the DiversitySelector class, which combines greedy and stochastic approaches in an optimization algorithm that chooses a diverse subset of reactants from a larger pool of reactants.
The objective function to be minimized is the average nearest neighbor similarity within the subset. Minimization is achieved by starting with a random subset of reactants and then repeatedly attempting to replace the subset member exhibiting the highest nearest neighbor similarity with a randomly chosen member of the pool. Replacements that decrease the average nearest neighbor similarity are always made, whereas replacements that increase it are accepted or rejected in accordance with a Monte Carlo test whose probability of being satisfied decreases from 50% to 1% over the course of the optimization.
For further details, see: Bioorg. Med. Chem. 2012 20, 5379–5387.
Copyright Schrodinger LLC, All Rights Reserved.
-
class
schrodinger.application.combinatorial_diversity.diversity_selector.
DiversitySelector
(fp_file, opt_cycles=10, convrg_tol=0.001, convrg_cycles=3, mc_tol=0.001, rand_seed=1, logger=None)¶ Bases:
object
Given a file of dendritic fingerprints for a set of reactants, this class employs a greedy, stochastic algorithm to select a diverse subset of reactants.
-
__init__
(fp_file, opt_cycles=10, convrg_tol=0.001, convrg_cycles=3, mc_tol=0.001, rand_seed=1, logger=None)¶ Constructor taking the name of a reactant fingerprint file and options for minimizing nearest neighbor similarities.
Parameters: - fp_file (str) – Input file of reactant fingerprints.
- opt_cycles (int) – Maximum number of optimization cycles. For a subset of N reactants, an optimization cycle consists of N passes, each of which involves an attempt to replace the reactant with the highest nearest neighbor similarity.
- convrg_tol (float) – Convergence tolerance on the absolute change in average nearest neighbor similarity. If the change is less than this value, the convergence tolerance is satisfied.
- convrg_cycles – Number of consecutive cycles over which the convergence tolerance must be satisfied in order to halt the optimization.
- mc_tol (float) – Monte Carlo criterion. An increase of mc_tol in average nearest neighbor similarity will be accepted with a probability of 50% in the first cycle and 1% in the last cycle.
- rand_seed (int) – Random seed for initial subset selection and Monte Carlo tests.
- logger (logging.Logger) – Logger for output of INFO level progress messages. Feedback can be helpful when large subsets are selected, as a given optimization cycle may take minutes or longer if the subset is significantly larger than 1000.
-
select
(num_diverse)¶ Selects the indicated number of diverse reactants and stores the subset data in the following member variables:
self.subset_rows - 0-based reactant row numbers self.subset_titles - Reactant titles self.subset_smiles - Reactant SMILES
- param num_diverse: Desired number of diverse reactants. Note that
- computational effort scales quadratically with this number, and values significantly larger than 1000 may lead to very long run times.
type num_diverse: int
-