schrodinger.analysis.enrichment.enrichment_input module

Input file parser for enrichment module.

For most virtual screen result input formats, titles are used to identify the ligands. The input is expected to be correctly ordered. If it is not ordered, please set the optional parameter sort_header in parser functions to the correct score header/property. If the file contains duplicate titles then only the first occurrence of a unique title is ranked.

Input file formats:

<actives_file>
    Text file.
        Raw text, one title per line.
    Structure file.
        A file containing structures with a meaningful title.
    CSV file.
        A comma-separated values file.
    List(str).
        A list of active string titles.
<results_file>
    Structure file, e.g. 'foo_pv.mae'
        A file containing ordered structures.
    CSV file.
        A comma-separated values file containing ranked titles ordered by
        virtual screen scoring metric.
    List(str) or List(structure).
        A list of ranked titles ordered by virtual screen scoring metric.

API examples:

# Ex. 1) Calculate BEDROC
active_titles = extract_active_titles_from_txt(actives_file)
total_actives, total_ligands, active_ranks, adjusted_active_ranks,
    total_ranked, title_ranks = extract_ranks_from_mae(
        mae_file_name="screen_results.maegz",
        active_titles=active_titles,
        num_decoy=1000)
bedroc, bedroc_ra = metrics.calcBEDROC(total_actives, total_ligands,
                                       active_ranks, 20.0)

# Ex. 2) Using the reporter class to calculate the default set of metrics.
         Note that this is not a good practice.
r = reporter.EnrichmentReporter(
    actives_file="my_actives.txt",
    results_file="screen_results.maegz",
    num_decoy=1000)
r.report()

Copyright Schrodinger, LLC. All rights reserved.

class schrodinger.analysis.enrichment.enrichment_input.FingerprintComponent(fp_gen, fp_sim, active_fingerprint, min_Tc_total_actives)[source]

Bases: object

Data class that contains critical objects that all fingerprint-related metrics functions (calc_DEF, calc_DEFStar and calc_DEFP) need.

Variables
  • fp_gen (CanvasFingerprintGenerator) – Object needed to generate fingerprint for each active title.

  • fp_sim (CanvasFingerprintSimilarity) – Object needed to compare fingerprint similarity for each active pair.

  • active_fingerprint (dict) – Title keys for fingerprint. Not available for screen results that don’t include title and structure information.

  • min_Tc_total_actives (float) – A float representing the lowest Tc, Tanimoto coefficient, of all the active similarity pairs.

__init__(fp_gen, fp_sim, active_fingerprint, min_Tc_total_actives)[source]

Initialize self. See help(type(self)) for accurate signature.

schrodinger.analysis.enrichment.enrichment_input.extract_active_titles_from_csv(actives_file)[source]

Parse actives_file as a csv file, return distinct active titles. Repeated active titles are ignored.

Parameters

actives_file (str) – A csv file containing all active titles.

Returns

Distinct active titles from the actives file.

Return type

set(str)

schrodinger.analysis.enrichment.enrichment_input.extract_active_titles_from_mae(actives_file)[source]

Parse actives_file as a maestro file, return distinct active titles. Repeated active titles are ignored.

Parameters

actives_file (str) – A maestro file containing all active titles.

Returns

Distinct active titles from the actives file.

Return type

set(str)

schrodinger.analysis.enrichment.enrichment_input.extract_active_titles_from_txt(actives_file)[source]

Parse actives_file as a raw text file with one title per line, return distinct active titles from the actives file. Repeated active titles are ignored.

Parameters

actives_file (str) – Raw text file containing one title per line.

Returns

Distinct active titles from the actives file.

Return type

set(str)

schrodinger.analysis.enrichment.enrichment_input.extract_active_titles_from_list(actives)[source]

Parse actives from list of string, return distinct active titles from the list. Repeated active titles are ignored.

Parameters

actives (list(str)) – A list of strings containing all active titles.

Returns

Distinct active titles from the actives file.

Return type

set(str)

schrodinger.analysis.enrichment.enrichment_input.extract_ranks_from_list(titles_iter, active_titles, num_decoy=0)[source]

Compute and return rank and count related terms from a list of ligand titles pre-sorted by virtual screen scoring metric.

Parameters
  • titles_iter (list(str)) – A list of title strings, pre-sorted by virtual screen scoring metric.

  • active_titles (set(str)) – Distinct active titles from the actives file

  • num_decoy (int) – The total number of decoys. If specified, the total number of ligands will be distinct active titles from actives file + num_decoy. This will enable the calculation of the correction term in calc_AUAC, should the total number of ligands not equal to the total number of ranked titles in results_file.

Returns

A tuple containing total number of active titles, total number of ligand titles, active ranks, adjusted active ranks, total number of ranked titles, and a dictionary storing active titles as keys and their ranks as value.

Return type

int, int, list(int), list(int), int, dict(str, int)

schrodinger.analysis.enrichment.enrichment_input.extract_ranks_from_csv(csv_file_name, active_titles, num_decoy=0, id_header='Title', sort_header=None)[source]

Compute and return rank and count related terms from a csv file.

Parameters
  • csv_file_name (str) – File name of the csv file that contains the virtual screening result.

  • active_titles (set(str)) – Distinct active titles from the actives file

  • num_decoy (int) – The total number of decoys. If specified, the total number of ligands will be distinct active titles from actives file + num_decoy. This will enable the calculation of the correction term in calc_AUAC, should the total number of ligands not equal to the total number of ranked titles in results_file.

  • id_header (str) – Name of compound-identifying header.

  • sort_header (str) – Name of the virtual screen scoring metric header to sort on. (not implemented)

Returns

A tuple containing total number of active titles, total number of ligand titles, active ranks, adjusted active ranks, total number of ranked titles, and a dictionary storing active titles as keys and their ranks as value.

Return type

int, int, list(int), list(int), int, dict(str, int)

schrodinger.analysis.enrichment.enrichment_input.extract_ranks_from_structures(structure_iter, active_titles, num_decoy=0, id_property='s_m_title', sort_property=None)[source]

Compute and return rank and count related terms from a list of structures.

Parameters
  • structure_iter (list(structure.Structure)) – A list of structure.Structure.

  • active_titles (set(str)) – Distinct active titles from the actives file

  • num_decoy (int) – The total number of decoys. If specified, the total number of ligands will be distinct active titles from actives file + num_decoy. This will enable the calculation of the correction term in calc_AUAC, should the total number of ligands not equal to the total number of ranked titles in results_file.

  • id_property (str) – Name of compound-identifying property.

  • sort_property (str) – Name of the virtual screen scoring metric property to sort on. (not implemented)

Returns

A tuple containing total number of active titles, total number of ligand titles, active ranks, adjusted active ranks, total number of ranked titles, and a dictionary storing active titles as keys and their ranks as value.

Return type

int, int, list(int), list(int), int, dict(str, int)

schrodinger.analysis.enrichment.enrichment_input.extract_ranks_from_mae(mae_file_name, active_titles, num_decoy=0, id_property='s_m_title', sort_property=None)[source]

Compute and return rank and count related terms from a structure file.

Parameters
  • mae_file_name (str) – A structure file that contains the virtual screening result.

  • active_titles (set(str)) – Distinct active titles from the actives file

  • num_decoy (int) – The total number of decoys. If specified, the total number of ligands will be distinct active titles from actives file + num_decoy. This will enable the calculation of the correction term in calc_AUAC, should the total number of ligands not equal to the total number of ranked titles in results_file.

  • id_property (str) – Name of compound-identifying property.

  • sort_property (str) – Name of the virtual screen scoring metric property to sort on. (not implemented)

Returns

A tuple containing total number of active titles, total number of ligand titles, active ranks, adjusted active ranks, total number of ranked titles, and a dictionary storing active titles as keys and their ranks as value.

Return type

int, int, list(int), list(int), int, dict(str, int)

schrodinger.analysis.enrichment.enrichment_input.get_fingerprint_components(structure_file, active_titles, id_property='s_m_title')[source]

Initialize and return a data class object needed for fingerprint-related calculations.

Parameters
  • structure_file (str or list(str)) – Structure file or a list of structures.

  • active_titles (set(str)) – Distinct active titles from the actives file

  • id_property (str) – Name of compound-identifying property.

Returns

The initialized enrichment_input.FingerprintComponent object.

Return type

enrichment_input.FingerprintComponent