schrodinger.analysis.enrichment module

A module for calculating Enrichment Factors and reporting on the effectiveness of a ligand database screen seeded with known actives.

The module takes information about a virtual screen outcome and calculates metrics that are commonly used to judge a screen’s ability to rank known actives. The metrics include terms such as Receiver Operator Characteristic area under the curve (ROC), Enrichment Factors, and Robust Initial Enhancement. By default a report containing a suite of metrics is directed to standard out. The module can also create basic Sensitivity v 1-Specificity plots. See cmdline_doc for details on what metrics are calculated. See Class documentation and __main__ for API examples.

For most screen result input formats titles are used to identify the ligands and the input is expected to be correctly ordered. If the file contains duplicate titles then only the first occurence of a unique title is ranked. Careful consideration must be made when Glide results contain multiple titles to ensure they are properly ordered; ordering by glide score is typically not sufficient. For example, after saving mulitple poses per input ligand from a Glide SP and HTVS docking experiment the poses of the same chemical species should be ordered by emodel, and different species by gscore. schrodinger.structutils.sort.py has tools to facilitate proper ordering of Glide structure files.

suite2011 changes: enrichment.Calculator When parsing text active files leading whitespace is now honored such that ‘ligand1’ and ‘ ligand1’ two distinct identifiers. However, trailing whitespace stripped when reading titles from a structure file and may cause problems matching values and recognizing actives.

enrichment.Calculator Canvas linear fingerprints are now calculated with the appropriate default atom/bond type. As a consequence, some DEF metrics may change slightly.

Copyright Schrodinger, LLC. All rights reserved.

class schrodinger.analysis.enrichment.ActiveDecoyFingerprintAnalyzer(results, input_active_ids, fp_gen=None, fp_sim=None, id_prop='s_m_title')

Convert a structure file of screen results (or inputs) into active and decoy fingerprints and report similarity metrics between the fingerprint sets.

API example:

actdecoyfpanalyzer = ActiveDecoyFingerprintAnalyzer(
    'docking_results_pv.mae',
    ['active1', 'active2', 'active3']
)
actdecoyfpanalyzer.analyzeFingerprints()
actdecoyfpanalyzer.writeCsv('analysis.csv')
Variables:
  • decoy_ids – All the unique non-active ids (titles) observed in the results.
  • active_ids – All the unique active ids (titles) observed in the results.
  • input_active_ids – The full set of active ids, regardless of whether they appeared in the result file or not.
  • active_fps – Sequence of active fingerprints.
  • decoy_fps – Sequence of decoy fingerprints.
analyzeFingerprints()

Create fingerprints from the self.results structure files, classifying members of actives and groups based on the ids.

getMaxSimDecoyByActiveId(active_id)
Returns:(similarity, id) tuple for the decoy that is most similar to the active_id
Return type:tuple
Parameters:active_id (string) – Identifier of the active to search against.
Raise:ValueError if active_id is not in the list of known actives.
getMeanMaxActiveActiveSim()
Returns:mean of the maximum Active-Active similarity scores.
getMeanMaxActiveDecoySim()
Returns:mean of the maximum Active-Decoy similarity scores.
getMeanMaxDecoyDecoySim()
Returns:mean of the maximum Decoy-Decoy similarity scores.
getMeanSimDecoyByActiveId(active_id)
Returns:The mean similarity value for the decoys to the active_id
Return type:float
Parameters:active_id (string) – Identifier of the active to search against.
Raise:ValueError if active_id is not in the list of known actives.
getMinSimDecoyByActiveId(active_id)
Returns:(similarity, id) tuple for the decoy that is least similar to the active_id
Return type:tuple
Parameters:active_id (string) – Identifier of the active to search against.
Raise:ValueError if active_id is not in the list of known actives.
getSimHistogramByActiveId(active_id)
Returns:(histogram, bin_edges) tuple of numpy arrays for the best decoy similarities to the active_id
Return type:tuple
Parameters:active_id (string) – Identifier of the active to search against.
Raise:ValueError if active_id is not in the list of known actives.
getStdSimDecoyByActiveId(active_id)
Returns:The standard deviation of the mean similarity value for the decoys to the active_id
Return type:float
Parameters:active_id (string) – Identifier of the active to search against.
Raise:ValueError if active_id is not in the list of known actives.
writeCsv(csv_filename='active_decoy_fp_analysis.csv')
Parameters:csv_filename (string) – Path of the file to write.

Write a CSV file that summarizies the Fingerprint analysis. The headers are:

  • Active Id
  • Max. Decoy Similarity Score
  • Max. Decoy Similarity Id
  • Min. Decoy Similarity Score
  • Min. Decoy Similarity Id
  • Mean Decoy Similarity
  • Mean Max. Active-Decoy Similarity
  • Mean Max. Decoy-Decoy Similarity
  • Num. Obs. Decoy Ids
  • Num. Obs. Active Ids
class schrodinger.analysis.enrichment.Calculator(actives_file_name, results, total_decoys, legend_label=None)

Bases: object

A class to calculate enrichment terms for a screen.

API examples:

# Ex. 1)  Reading screen result data from file.
efcalc = enrichment.Calculator(
    actives_file_name = "my_actives.txt",  # Active titles, one per line.
    results = "screen_results.rept", # Glide report file.
    total_decoys = 1000
)
efcalc.calculateMetrics() # Calculate a default suite of terms.
efcalc.report() # Print default report to standard out.
efcalc.savePlot() # Create default graph png.
print efcalc.calcBEDROC(alpha=20) # Print the BEDROC metric value.

# Ex. 2) Using a structure sequence as screen result data.
results = []
for st in structure.StructureReader('iglur_dock_pv.maegz', 2):
    results.append(st)

efcalc = enrichment.Calculator(
    actives_file_name = "my_actives.txt", # Active titles, one per line.
    results = results, # Iterable sequence of structure.Structure objects.
    total_decoys = 1000
)
efcalc.calculateMetrics()
efcalc.report()
Variables:
  • table_sep (str) – Token used to separate column fields. Default is a None, i.e. whitespace.
  • rept_file_ext (list(str)) – List of parsable Glide report file extensions.
  • csv_file_ext (list(str)) – List of parsable csv file extensions.
  • table_file_ext (list(str)) – List of parsable table file extensions.
  • ef_precision (int) – Number of decimals when reporting EF values. Default = 2
  • efp_precision (int) – Number of decimals when reporting EF’ values. Default = 2
  • efs_precision (int) – Number if decimals when reporting EF* values. Default = 2
  • eff_precision (int) – Number of decimals when reporting Eff values. Default = 3
  • fod_precision (int) – Number of decimals when reporting FOD values. Default = 1
  • total_actives (int) – The number of all active ligands in the screen, ranked and unranked.
  • total_ligands (int) – The number of the total number of ligands (actives and unknowns/decoys) used in the screen.
  • active_ranks (list(int)) – List of unadjusted integer ranks for the actives found in the screen. For example, a screen result that placed three actives as the first three ranks has an active_ranks list of = [1, 2, 3].
  • adjusted_active_ranks (list(int)) – Modified active ranks; each rank is improved by the number of preceding actives. For example, a screen result that placed three actives as the first three ranks, [1, 2, 3], has adjusted ranks of [1, 1, 1]. In this way, actives are not penalized by being outranked by other actives.
  • active_titles (list(str)) – The titles of the known actives in the screen.
  • missing_active_titles (list(str)) – The titles of ligands not discovered in the screen results.
  • title_ranks (dict(str, int)) – Unadjusted integer rank keys for title. Not available for table inputs, or other screen results that don’t list the title.
  • active_fingerprint (dict) – Title keys for fingerprint. Not available for screen results that don’t include title and structure information.
calcAUAC()
Returns:A float representation of the Area Under the Accumulation Curve.
Return type:float
calcActivesInN(n_sampled_set)
Returns:the number of the known active ligands found in a given sample size.
Parameters:n_sampled_set (integer) – The number of rank results for which to calculate the metric. Every active with a rank less than or equal to this value will be counted as found in the set.
calcActivesInNStar(n_sampled_set)
Returns:the number of the known active ligands found in a given sample size.
Parameters:n_sampled_set (integer) – The number of rank results for which to calculate the metric. Every active with a rank less than or equal to this value will be counted as found in the set.
calcAveNumberOutrankingDecoys()
Returns:the average number of decoys that outranked the actives.

The rank of each active is adjusted by the number of outranking actives. The number of outranking decoys is then defined as the adjusted rank of that active minus one. The number of outranking decoys is calculated for each docked active and averaged.

calcBEDROC(alpha=20.0)
Returns:a tuple of two floats, the first represents the area under the curve for the Boltzmann-enhanced discrimination of ROC (BEDROC) analysis, the second is the alpha*Ra term.
Parameters:alpha (float) – Exponential prefactor for adjusting early enrichment emphasis. Larger values more heavily weight the early ranks. alpha=20 weights the first ~8% of the screen, alpha=10 weights the first ~10% of the screen, alpha=50 weights the first ~3% of the screen results.
Raise:ValueError if alpha <= 0.
calcDEF(n_sampled_set, min_actives=None)
Returns:

Diverse Enrichment factor (DEF) for the given sample size of the screen results. If the fewer than min_actives are found in the set, or the calculation raises a ZeroDivisionError, the returned value is None.

Parameters:
  • n_sampled_set (integer) – The number of ranked decoy results for which to calculate the enrichment factor.
  • min_actives (integer) – The number of actives that must be within the n_sampled_set, otherwise the returned EF value is None.

DEF is defined as:

            1 - (min_similarity_among_actives_in_sampled_set)
DEF = EF * --------------------------------------------------
            1 - (min_similarity_among_all_actives)

where ‘n_sampled_set’ is the number of all ranks in which to search for actives.

calcDEFP(n_sampled_decoy_set, min_actives=None)
Returns:

Diverse Enrichment Factor prime (DEF’) for a given sample size. If the fewer than min_actives are found in the set the returned value is None.

Parameters:
  • n_sampled_decoy_set (integer) – The number of ranked decoy results for which to calculate the enrichment factor.
  • min_actives (integer) – The number of actives that must be within the n_sampled_decoy_set, otherwise the returned EF’ value is None.

DEF’ is defined as:

             1 - (min_similarity_among_actives_in_sampled_set)
DEF' = EF' * --------------------------------------------------
                  1 - (min_similarity_among_all_actives)
calcDEFStar(n_sampled_decoy_set, min_actives=None)
Returns:

Diverse Enrichment factor (DEF*) for the given sample size of the screen results, calculated with respect to the total decoys instead of the more traditional total ligands. If the fewer than min_actives are found in the set the returned value is None.

Parameters:
  • n_sampled_decoy_set (integer) – The number of ranked decoys for which to calculate the enrichment factor.
  • min_actives (integer) – The number of actives that must be within the n_sampled_decoy_set, otherwise the returned EF value is None.

Here, DEF* is defined as:

                 1 - (min_similarity_among_actives_in_sampled_set)
DEF = EF_star * --------------------------------------------------
                      1 - (min_similarity_among_all_actives)

where ‘n_sampled_decoy_set’ is the number of decoy ranks in which to search for actives.

calcEF(n_sampled_set, min_actives=None)
Returns:

the Enrichment factor (EF) for the given sample size of the screen results. If the fewer than min_actives are found in the set, or the calculation raises a ZeroDivisionError, the returned value is None.

Parameters:
  • n_sampled_set (integer) – The number of ranked results for which to calculate the enrichment factor.
  • min_actives (integer) – The number of actives that must be within the n_sampled_set, otherwise the returned EF value is None.

EF is defined as:

      n_actives_in_sampled_set / n_sampled_set
EF =  ----------------------------------------
           total_actives / total_ligands

where ‘n_sampled_set’ is the number of all ranks in which to search for actives.

calcEFF(fraction_of_decoys)
Returns:a float for the active recovery Efficiency (EFF) at a particular sample set size. The returned value is None if the calculation raises a ZeroDivisionError.
Parameters:fraction_of_decoys (float) – The size of the set is in terms of the number of decoys in the screen. For example, given 1000 decoys and fraction_of_decoys=0.20, actives that appear within the first 200 ranks are counted.

EFF is defined as:

                   frac. actives in sample
EFF = (2* -----------------------------------------------) - 1
          frac actives in sample + frac. decoys in sample
calcEFP(n_sampled_decoy_set, min_actives=None)
Returns:

the Enrichment Factor prime (EF’) for a given sample size. If the fewer than min_actives are found in the set the returned value is None.

Parameters:
  • n_sampled_decoy_set (integer) – The number of ranked decoy results for which to calculate the enrichment factor.
  • min_actives (integer) – The number of actives that must be within the n_sampled_decoy_set, otherwise the returned EF’ value is None.

EF’ is defined as:

                n_actives_sampled_set
EF' = -------------------------------------------
      cummulative_sum(frac. decoys/frac. actives)
calcEFStar(n_sampled_decoy_set, min_actives=None)
Returns:

the Enrichment factor* (EF*) for the given sample size of the screen results, calculated with respect to the total decoys instead of the more traditional total ligands. If the fewer than min_actives are found in the set the returned value is None.

Parameters:
  • n_sampled_decoy_set (integer) – The number of ranked decoys for which to calculate the enrichment factor.
  • min_actives (integer) – The number of actives that must be within the n_sampled_decoy_set, otherwise the returned EF value is None.

Here, EF* is defined as:

       n_actives_in_sampled_set / n_sampled_decoy_set
EF* =  ----------------------------------------------
            total_actives / total_decoys

where ‘n_sampled_decoy_set’ is the number of decoy ranks in which to search for actives.

calcFOD(fraction_of_actives)
Returns:the average fraction of decoys outranking the given fraction, provided as a float, of known active ligands. The returned value is None if a) the calculation raises as ZeroDivisionError, or b) fraction_of_actives is generates more actives than are ranked, or c) the fraction_of_actives is greater than 1.0
Parameters:fraction_of_actives (float) – Decimal notation of the fraction of sampled actives, used to set the sampled set size.

FOD is defined as:

                     __
           1         \    number_outranking_decoys_in_sampled_set
FOD = -------------  /   ---------------------------------------
       num_actives   --         total_decoys
calcMWUROC(alpha=0.05)
Returns:tuple of ROC AUC, the standard error, and estimated confidence interval (lower and upper bounds).
Parameters:alpha (float) – the signficance level. Default is 0.05 (95% confidence interval)

Here, the ROC AUC is based on the Mann-Whitney-Wilcoxon U. The U value is calculated directly:

U = R - ((n_a(n_a+1))/2)
# where n_a is the number of actives and R is the sum of their ranks. 

ROC AUC = ((n_a*n_i) - U)/(n_a*n_i)
# n_a is the number of actives, n_i is the number of decoys.

SE = sqrt((A(1-A) + (n_a-1)(Q - A^2) + (n_i -1)(q - A^2))/(n_a*n_i))

CI = SE * scipy.stats.t.ppf((1+(1-alpha))/2.0, ((n_a+n_i)-1))
calcRIE(alpha=20.0)
Returns:a float for the Robust Initial Enhancement (RIE).
Return type:float
Parameters:alpha (float) – Exponential prefactor for adjusting early enrichment emphasis. Larger values more heavily weight the early ranks. alpha=20 weights the first ~8% of the screen, alpha=10 weights the first ~10% of the screen, alpha=50 weights the first ~3% of the screen results.
calcROC()
Return type:float
Returns:A representation of the Receiver Operator Characteristic area underneath the curve. Typically interpreted as the probability an active will appear before an inactive. A value of 1.0 reflects ideal performance, a value of 0.5 reflects a performance on par with random selection.

Clasically ROC area is defined as:

       AUAC     Ra
ROC = ------ - -----
        Ri      2Ri

Where AUAC is the area under the accumulation curve, Ri is the ratio of inactives, Ra is the ratio of actives.

A different method is used here in order to account for unranked actives - see PYTHON-3055 & PYTHON-3106

calculateMetrics()

Sets a suite of enrichment factor terms as instance data members. See cmdline_doc for description of metrics.

The standard suite includes the attributes:

  • self.ave_num_outranking_decoys
  • self.bedroc20 (alpha=20.0)
  • self.bedroc160_9 (alpha=160.9)
  • self.bedroc8_0 (alpha=8.0)
  • self.roc
  • self.rie
  • self.auac
  • self.ef_40 (EF 40% of actives)
  • self.ef_50 (EF 50% of actives)
  • self.ef_60 (EF 60% of actives)
  • self.ef_70 (EF 70% of actives)
  • self.ef_80 (EF 80% of actives)
  • self.ef_90 (EF 90% of actives)
  • self.ef_100 (EF 100% of actives)
  • self.ef_1pct (EF top 1% of total ligands)
  • self.ef_2pct (EF top 2% of total ligands)
  • self.ef_5pct (EF top 5% of total ligands)
  • self.ef_10pct (EF top 10% of total ligands)
  • self.ef_20pct (EF top 20% of total ligands)
  • self.efs_40 (EF* 40% of actives)
  • self.efs_50 (EF* 50% of actives)
  • self.efs_60 (EF* 60% of actives)
  • self.efs_70 (EF* 70% of actives)
  • self.efs_80 (EF* 80% of actives)
  • self.efs_90 (EF* 90% of actives)
  • self.efs_100 (EF* 100% of actives)
  • self.efs_1pct (EF* top 1% of total decoys)
  • self.efs_2pct (EF* top 2% of total decoys)
  • self.efs_5pct (EF* top 5% of total decoys)
  • self.efs_10pct (EF* top 10% of total decoys)
  • self.efs_20pct (EF* top 20% of total decoys)
  • self.efp_40 (EF’ 40% of actives)
  • self.efp_50 (EF’ 50% of actives)
  • self.efp_60 (EF’ 60% of actives)
  • self.efp_70 (EF’ 70% of actives)
  • self.efp_80 (EF’ 80% of actives)
  • self.efp_90 (EF’ 90% of actives)
  • self.efp_100 (EF’ 100% of actives)
  • self.efp_1pct (EF’ top 1% of total decoys)
  • self.efp_2pct (EF’ top 2% of total decoys)
  • self.efp_5pct (EF’ top 5% of total decoys)
  • self.efp_10pct (EF’ top 10% of total decoys)
  • self.efp_20pct (EF’ top 20% of total decoys)
  • self.fod_40 (FOD 40% of actives)
  • self.fod_50 (FOD 50% of actives)
  • self.fod_60 (FOD 60% of actives)
  • self.fod_70 (FOD 70% of actives)
  • self.fod_80 (FOD 80% of actives)
  • self.fod_90 (FOD 90% of actives)
  • self.fod_100 (FOD 100% of actives)
  • self.eff_1pct (Eff top 1% of total decoys)
  • self.eff_2pct (Eff top 2% of total decoys)
  • self.eff_5pct (Eff top 5% of total decoys)
  • self.eff_10pct (Eff top 10% of total decoys)
  • self.eff_20pct (Eff top 20% of total decoys)
  • self.actives_in_top_1_pct (of total ligands)
  • self.actives_in_top_2_pct (of total ligands)
  • self.actives_in_top_5_pct (of total ligands)
  • self.actives_in_top_10_pct (of total ligands)
  • self.actives_in_top_20_pct (of total ligands)
  • self.pct_actives_in_top_1_pct (of total ligands)
  • self.pct_actives_in_top_2_pct (of total ligands)
  • self.pct_actives_in_top_5_pct (of total ligands)
  • self.pct_actives_in_top_10_pct (of total ligands)
  • self.pct_actives_in_top_20_pct (of total ligands)
  • self.actives_in_top_1_pct_star (of total decoys)
  • self.actives_in_top_2_pct_star (of total decoys)
  • self.actives_in_top_5_pct_star (of total decoys)
  • self.actives_in_top_10_pct_star (of total decoys)
  • self.actives_in_top_20_pct_star (of total decoys)
  • self.pct_actives_in_top_1_pct_star (of total decoys)
  • self.pct_actives_in_top_2_pct_star (of total decoys)
  • self.pct_actives_in_top_5_pct_star (of total decoys)
  • self.pct_actives_in_top_10_pct_star (of total decoys)
  • self.pct_actives_in_top_20_pct_star (of total decoys)
  • self.def_1pct (DEF top 1% of actives)
  • self.def_2pct (DEF top 2% of actives)
  • self.def_5pct (DEF top 5% of actives)
  • self.def_10pct (DEF top 10% of actives)
  • self.def_20pct (DEF top 20% of actives)
  • self.defs_1pct (DEF* top 1% of actives)
  • self.defs_2pct (DEF* top 2% of actives)
  • self.defs_5pct (DEF* top 5% of actives)
  • self.defs_10pct (DEF* top 10% of actives)
  • self.defs_20pct (DEF* top 20% of actives)
  • self.defp_1pct (DEF’ top 1% of total decoys)
  • self.defp_2pct (DEF’ top 2% of total decoys)
  • self.defp_5pct (DEF’ top 5% of total decoys)
  • self.defp_10pct (DEF’ top 10% of total decoys)
  • self.defp_20pct (DEF’ top 20% of total decoys)
calculateSensitivity(rank)
Calculates sensitivity at a particular rank, defined as:
Se(rank) = found_actives / total_actives
Parameters:rank (int) – active rank at which to calculate the specificity
Returns:sensitivity of the screen at a given rank
Return type:float
calculateSpecificity(rank)
Calculates specificity at a particular rank, defined as:
Sp(rank) = discarded_decoys / total_decoys
Parameters:rank (int) – active rank at which to calculate the specificity
Returns:specificity of the screen at a given rank
Return type:float
csv_file_ext = ['.csv', '.CSV']
ef_precision = 2
eff_precision = 3
efp_precision = 2
efs_precision = 2
fod_precision = 1
format(value, precision=2)
Returns:

a string representation of the passed value. If the value is None then the returned string is ‘n/a’. Uses %g formatting idiom so large values are returned as exponentials.

Parameters:
  • value (float or None) – Float value to format as string.
  • precision (integer) – Number of digits after the decimal.
getActiveRankCsvRows()
Returns:a list of active Title, Rank, Sensitivity, Specificity, %Actives Found, %Screen tuples.
Return type:list
Note:this list may grow, but the relative order of the columns should remain fixed.
getCsvRows()
Returns:a list of header and enrichment value tuples.
Return type:list
getPercentScreenCurvePoints()
Returns:List of (%Screen, %Actives Found) tuples for the active ranks.
getROCAreaRomberg(lower_limit=0.0, upper_limit=1.0)
Returns:Receiver Operator Characteristic area under the curve as defined by a Romberg integration between arbitrary points along 1-Sp (domain: 0-1).
Return type:float
Raise:ValueError if lower_limit is less than 0, or lower_limit is greater than upper_limit.
getROCCurvePoints()

Calculates set of points in ROC curve along each active rank.

Returns:list of (1 - specificity, sensitivity, rank) tuples
Return type:list of tuples
max_ef_value
classmethod parseCanvasCsv(input_csv)
Returns:A list of csv subfiles generated by parsing the input csv file as a Canvas Similarity Matrix. The output sub-file names have the form <basename>.<index>.<title>.csv. They are sorted by descending values (1.0->0.0).
Return type:list
Parameters:input_csv (string) – Path to the file to parse. First column contains the titles for the hits, The second and subsequent columns are the probes (active compounds).
parseInput()

Sets instance data members from parsed input actives and results files.

Raise:TypeError if the input can’t be parsed.
report(file_handle=<open file '<stdout>', mode 'w'>, header='', footer='')
Returns:None. Prints text summary of results to the file_handle.
Parameters:file_handle (file) – File handle-like object, default is sys.stdout.
rept_file_ext = ['.rept']
savePlot(png_file='plot.png', title='Screen Results', xlabel='1-Specificity', ylabel='Sensitivity')
Returns:

None. Saves a image of the ROC plot, Sensitivity v 1-Specificity, to a png file.

Parameters:
  • png_file (string) – Path to output file, default is ‘plot.png’.
  • title (string) – Plot title, default is ‘Screen Results’.
  • xlabel (string) – x-axis label, default is ‘1-Specificity’.
  • ylabel (string) – y-axis label, default is ‘Sensitivity’.
table_file_ext = ['.tbl', '.txt']
table_sep = None
title_rank_re = <_sre.SRE_Pattern object>
class schrodinger.analysis.enrichment.FingerprintFromFileGenerator(file_name, fp_gen, id_prop='s_m_title', prop_filter=None)

Class to generate (identifier, fingerprint) items, one at a time, from a structure file. Optional property filtering controls which structures are emitted as fingerprints. By default only unique structures, as judged by id_prop, are generated.

API Examples:

# Only unique fingerprint for structures with a title of 'foo',
# 'bar', or 'baz' are generated.
activefps = FingerprintFromFileGenerator(
    structure.StructureReader('mystructures.mae'),
    fp_gen,
    prop_filter={'s_m_title': ['foo', 'bar', 'baz']},
)
for id, fp in activefps:
    process_fingerprint(fp)

# Only unique fingerprint for structures that don't have a title
# of 'foo', 'bar', or 'baz' are generated.
inactivefps = FingerprintFromFileGenerator(
    structure.StructureReader('mystructures.mae'),
    fp_gen,
    prop_filter={'s_m_title': ['foo', 'bar', 'baz']},
)
inactivfps.setInvertFilter()
for id, fp in inactivefps:
    process_fingerprint(fp)
setFilter()
setInvertFilter()
setNonUniqueIds()

Generate fingerprints for all occurences of structures that pass the prop_filter.

Only generate unique occurences of <id_prop>

setUniqueIdsOnly()

Only generate unique occurences of <id_prop>

exception schrodinger.analysis.enrichment.NoActivesRankedException

Bases: exceptions.ValueError

class schrodinger.analysis.enrichment.PercentScreenPlotter(calcs, title='Screen Results', xlabel='Percent Screen', ylabel='Percent Actives Found')

Bases: schrodinger.analysis.enrichment._BasePlotter

A class to plot multiple series of Calculator data as %Actives Found vs %Screen.

API example where enrich_calc1 and enrich_calc2 are instances of Calculator:

enrich_plotter = enrichment.PercentScreenPlotter([enrich_calc1, enrich_calc2])
enrich_plotter.plot() # Launch interactive plot window.
enrich_plotter.savePlot('my_plot.png') # Save plot to file.

There are six line styles defined by default. Plotting more than six results cycles through the styles.

getPointsFromCalc(calc)

Returns points for this metric from the given Calculator instance.

Parameters:calc (Calculator) – Calculator instance.

:return List of (x, y) points :rtype: list

class schrodinger.analysis.enrichment.Plotter(calcs, title='Screen Results', xlabel='1-Specificity', ylabel='Sensitivity')

Bases: schrodinger.analysis.enrichment._BasePlotter

A class to plot multiple series of Calculator data.

API example where enrich_calc1 and enrich_calc2 are instances of Calculator:

enrich_plotter = enrichment.Plotter([enrich_calc1, enrich_calc2])
enrich_plotter.plot() # Launch interactive plot window.
enrich_plotter.savePlot('my_plot.png') # Save plot to file.

There are six line styles defined by default. Plotting more than six results cycles through the styles.

getPointsFromCalc(calc)

Returns points for this metric from the given Calculator instance.

Parameters:calc (Calculator) – Calculator instance.

:return List of (x, y) points :rtype: list

class schrodinger.analysis.enrichment.StatisticalSummary(identifier, values, histogram_range=(0.0, 1.0), histogram_bins=10)

Container class to store summary metrics describing an array of numerical values.

API example:

values = [x/1000 for x in range(1, 1001)]
stat_summary = StatisticalSummary(
    'test',
    numpy.array(values)
)
print stat_summary.mean
>> 0.50
print stat_summary.std
>> 0.288
bin_edges

Numpy histogram bin edges of the original value array.

count

Number of items in the original value array.

first_quartile

First quartile of the original value array.

histogram

Numpy histogram of the original value array.

identifier
max

Maximum value of the original value array.

max_index

Index of the maximum value of the original value array.

mean

Mean of the original value array.

median

Second quartile of the original value array.

min

Minimum value of the original value array.

min_index

Index of the minimum value of the original value array.

std

Standard deviation of the original value array.

third_quartile

Third quartile of the original value array.

class schrodinger.analysis.enrichment.TitleEnrichmentCalculator(actives_file_name, results, total_decoys, legend_label=None)

Bases: schrodinger.analysis.enrichment.Calculator

SYNOPSIS

actives = ‘one two three’.split() ndecoys = 10 # we’ll pretend that actives ‘one’ and ‘two’ docked successfully, as # well as five of the decoys, in the order below: results = ‘one d1 two d2 d5 d6 d9’.split()

c=TitleEnrichmentCalculator(actives, results, ndecoys) c.calculateMetrics() c.report()

DESCRIPTION

This is a subclass of schrodinger.analysis.enrichment.Calculator that takes simple lists of titles (strings) instead of filenames or structure objects. This is meant as an optimization to reduce overhead when thousands of enrichment calculations need to be done on the fly.

The actives list is just a list of active ligand titles.

The results list is a list of ligand titles, sorted by rank.

When a title appears more than once, only the first occurrence is counted.

parseInput()
class schrodinger.analysis.enrichment.TwoGroupFingerprintAnalyzer(fp_group1, fp_group2, fp_sim)

Summarize the similarity differences within and between two groups of fingerprints. All i!=j pair-wise similarity comparisons are made within a group, and all pair-wise similarity comparsions are made between the groups. However, the list of similarity values is compressed into a statistical digest (min, max, median, etc.)

API example:

fp_gen.setType('Linear')
fp_gen.setAtomBondTyping(fp_gen.getDefaultAtomTypingScheme())
group1 = []
group2 = []
# Title is the compound identifer in this example.
for st in structure.StructureReader('actives.mae'):
    item = (st.title, fp_gen.generate(st))
    group1.append(item)
for st in structure.StructureReader('decoys.mae'):
    item = (st.title, fp_gen.generate(st))
    group2.append(item)
fp_sim = canvas_sim.CanvasFingerprintSimilarity(logger=logger)
twogrpfp = test.TwoGroupFingerprintAnalyzer(
    group1,
    group2,
    fp_sim,
)
twogrpfp.analyzeGroups()
print twogrpfp.within_group1_global_max
print twogrpfp.within_group2_global_max
print twogrpfp.between_groups_global_max
Variables:
  • within_group1_summary – List of within-group1 StatisticalSummary instances the same order and length of the number if items emitted from fp_group1. i=j pairs are excluded from the analysis.
  • within_group2_summary – List of within-group2 StatisticalSummary instances the same order and length of the number if items emitted from fp_group2. i=j pairs are excluded from the analysis.
  • between_groups_summary – List of between-group1-and-group2 StatisticalSummary instances the same order and length of the number if items emitted from fp_group1. i=j pair are included in the analysis.
  • group1_ids – Ordered list of group1 identifiers.
  • group2_ids – Ordered list of group1 identifiers.
analyzeBetweenGroups()
analyzeGroups()

Analyze within each group, and between groups.

analyzeWithinGroups()
between_groups_global_max

Maximum value over all between group1-2 summary members.

between_groups_global_min

Minimum value over all between group1-2 summary members.

within_group1_global_max

Maximum value over all group1 summary members.

within_group1_global_min

Minimum value over all group1 summary members.

within_group2_global_max

Maximum value over all group2 summary members.

within_group2_global_min

Minimum value over all group2 summary members.