schrodinger.application.combinatorial_diversity.splitter_utils module

This module provides functionality for splitting a large data set into smaller chunks for scalable diversity selection via DiversitySelector.

Copyright Schrodinger LLC, All Rights Reserved.

schrodinger.application.combinatorial_diversity.splitter_utils.compute_factor_scores(xcols, evectors)[source]

Given N columns of autoscaled X variables and the N eigenvectors obtained from PCA of those X variables, this function computes the score on each eigenvector for each row of X values.

Parameters
  • evectors (numpy.ndarray) – Eigenvectors from a PCA analysis. The jth eigenvector is stored in evectors[:, j].

  • xcols (numpy.ndarray) – Columns of autoscaled X variables. The jth column is stored in xcols[j].

Returns

N PCA scores for each row in xcols. The shape of the returned vector is (xcols.shape[1], xcols.shape[0]), i.e., the shape of the transpose of xcols.

Return type

numpy.ndarray

schrodinger.application.combinatorial_diversity.splitter_utils.compute_sim_to_probes(fp_file, probe_rows)[source]

Given a 32-bit fingerprint file and the 0-based row numbers for N diverse probe structures, this function computes columns of autoscaled Tanimoto similarities between the probes and all fingerprints in the file.

Parameters
  • fp_file (str) – Input file of 32-bit fingerprints.

  • probe_rows (list(int)) – List of 0-based fingerprint row numbers for N diverse probe structures.

Returns

N columns of autoscaled similarities.

Return type

numpy.ndarray

schrodinger.application.combinatorial_diversity.splitter_utils.create_sim_cormat(sim_cols)[source]

Given N columns of autoscaled similarities, this function creates an an NxN matrix of Pearson correlations among those columns.

Parameters

sim_cols (numpy.ndarray) – N columns of autoscaled similarities.

Returns

NxN correlation matrix.

Return type

numpy.ndarray

schrodinger.application.combinatorial_diversity.splitter_utils.diagonalize_symmat(symmat)[source]

Diagonalizes a real, symmetric matrix and returns the eigenvalues and eigenvectors sorted by decreasing eigenvalue.

Parameters

symmat (numpy.ndarray) – Real, symmetric matrix. Not modified.

Returns

Reverse-sorted eigenvalues, followed by eigenvectors. The jth eigenvector is stored in the column slice [:, j] of the returned numpy.ndarray.

Return type

numpy.float64, numpy.ndarray

schrodinger.application.combinatorial_diversity.splitter_utils.get_all_orthant_strings(ndim)[source]

Yields all possible orthant strings for the given number of dimensions. For example, if ndim = 2, this function would yield the 2-dimensional orthant strings ‘++’, ‘+-‘, ‘-+’, ‘–’. These correspond to the usual 4 quadrants in xy space.

Parameters

ndim (int) – Number of dimensions.

Yield

All possible orthant strings of length ndim.

Ytype

str

schrodinger.application.combinatorial_diversity.splitter_utils.get_orthant_strings(scores, ndim)[source]

Given PCA factor scores over the full set of eigenvectors and a desired number of dimensions in that factor space, this function yields strings containing ‘+’ and ‘-‘ characters which indicate the orthant in which each row of scores resides. A value of ‘+’ is assigned if score >= 0 and a value of ‘-‘ is assigned if score is < 0.

For example, if a given row consists of the following scores on 8 factors:

[1.3289, -0.2439, -2.1774, 0.8391, 1.4632, -0.6268, 1.2238, -1.7802]

and ndim = 4, the orthant string would be ‘+–+’.

Parameters
  • scores (numpy.ndarray) – PCA factor scores (see compute_factor_scores).

  • ndim (int) – Number of factors to consider. This determines the number of characters in each orthant string.

Yield

Orthant string for each row in scores.

Ytype

str

schrodinger.application.combinatorial_diversity.splitter_utils.partition_scores(scores, min_pop)[source]

Given PCA factor scores over the full set of eigenvectors and a minimum required population, this function partitions the scores into distinct orthant pairs of nearly equal population, where the smallest population is guaranteed to be at least min_pop. This is achieved by making a series of calls to get_orthant_strings with progressively larger values of ndim, grouping the scores by orthant string, sorting by population size and then combining the highest and lowest populations, the 2nd highest and 2nd lowest populations, etc. These combined populations decrease as ndim is increased, and the largest value of ndim which allows min_pop to be satisfied is used.

For example: 1. Suppose ndim=4 is the largest dimension that satisfies min_pop 2. Suppose a given row of scores yields the orthant string ‘-+-+’ 3. Suppose orthant ‘-+-+’ is combined with orthant ‘+–+’ 4. That row of scores would be assigned to orthant pair ‘+–+|-+-+’

Parameters
  • scores (numpy.ndarray) – PCA factor scores (see compute_factor_scores).

  • min_pop (int) – Minimum required population of any orthant pair.

Returns

Dictionary of orthant pair –> list of 0-based row numbers.

Return type

dict{str: list(int)}

schrodinger.application.combinatorial_diversity.splitter_utils.select_probes(fp_file, num_probes, rand_seed)[source]

Selects the requested number of diverse probe structures from the provided 32-bit fingerprint file and returns the corresponding 0-based fingerprint row numbers.

Parameters
  • fp_file (str) – Input file of 32-bit fingerprints and SMILES.

  • num_probes (int) – Number of diverse probe structures.

  • rand_seed (int) – Random seed for underlying diversity algorithm.

Returns

List of 0-based row numbers for diverse probe structures.

Return type

list(int)