PIDGINv3 icon indicating copy to clipboard operation
PIDGINv3 copied to clipboard

Protein target prediction using random forests and reliability-density neighbourhood analysis

Prediction IncluDinG INactivity (PIDGIN) Version 3

|License| |docstatus| |betarelease|

Author : Lewis Mervin, [email protected]

Supervisor : Dr. A. Bender

Protein target prediction using Random Forests_ (RFs) trained on bioactivity data from PubChem_ (extracted 07/06/18) and ChEMBL_ (version 24), using the RDKit_ and Scikit-learn_, which employ a modification of the reliability-density neighbourhood Applicability Domain (AD) analysis by Aniceto [1]. This project is the sucessor to PIDGIN version 1 [2]_ and PIDGIN version 2_ [3]. Target prediction with extended NCBI pathway and DisGeNET disease enrichment calculation is available as implemented in [4].

  • Molecular Descriptors : 2048bit Rdkit Extended Connectivity FingerPrints_ (ECFP) [5]_
  • Algorithm: Random Forests_ with dynamic number of trees (see docs for details), class weight = 'balanced', sample weight = ratio Inactive:Active
  • Models generated at four different cut-off's: 100μM, 10μM, 1μM and 0.1μM
  • Models generated both with and without mapping to orthologues, as implemented in [3]_
  • Pathway information from NCBI BioSystems_
  • Disease information from DisGeNET_
  • Target/pathway/disease enrichment calculated using Fisher's exact test and the Chi-squared test

Details for sizes across all activity cut-off's:

+------------------------------------------------+-------------------------+---------------------------+ | | Without orthologues | With orthologues | +================================================+=========================+===========================+ | Distinct Models | 10,446 | 14,678 | +------------------------------------------------+-------------------------+---------------------------+ | Distinct Targets [exhaustive total] | 7,075 [7,075] | 16,623 [60,437] | +------------------------------------------------+-------------------------+---------------------------+ | Total Bioactivities Over all models | 39,424,168 | 398,340,769 | +------------------------------------------------+-------------------------+---------------------------+ | Actives | 3,204,038 | 35,009,629 | +------------------------------------------------+-------------------------+---------------------------+ | Inactives [Of which are Sphere Exclusion (SE)] | 36,220,130 [27,435,133] | 363,331,140 [248,782,698] | +------------------------------------------------+-------------------------+---------------------------+

Full details on all models are provided in the uniprot_information.txt files in the orthologue and no_orthologue directories

INSTRUCTIONS

Development occurs on GitHub_.

Install with Conda

Documentation, installation and instructions are on ReadtheDocs_.

IMPORTANT

  • Use the ReadtheDocs! You MUST download the models before running!
  • The program recognises as input line-separated SMILES in either .smi/.smiles or .sdf format
  • If the SMILES input contains data additional to the SMILES string, the first entries after the SMILES are automatically interpreted as identifiers (see the OpenSMILES specification <http://opensmiles.org/opensmiles.html>_ §4.5) - although there are options to change this behaviour
  • Molecules are automatically standardized when running models (can be turned off)
  • Do not modify the 'pkls', 'ad_data' etc. names or directories
  • Files in the examples directory are included for testing as on the ReadtheDocs_ tutorials.
  • For installation and usage instructions, see the documentation <http://pidginv3.readthedocs.io>_.

License

PIDGINv3 is available under the GNU General Public License v3.0 <https://www.gnu.org/licenses/gpl.html>_ (GPLv3).

References

.. [1] |aniceto| .. [2] |mervin2015| .. [3] |mervin2018| .. [4] |mervin2016| .. [5] |rogers|

.. _Random Forests: http://scikit-learn.org/0.19/modules/generated/sklearn.ensemble.RandomForestClassifier.html .. _PubChem: https://pubchem.ncbi.nlm.nih.gov/ .. _ChEMBL: https://www.ebi.ac.uk/chembl/ .. _RDKit: http://www.rdkit.org .. _Scikit-learn: http://scikit-learn.org/ .. _version 1: https://github.com/lhm30/PIDGIN .. _version 2: https://github.com/lhm30/PIDGINv2 .. _no_ortho.zip : https://tinyurl.com/no-ortho .. _https://tinyurl.com/no-ortho : https://tinyurl.com/no-ortho .. _2048bit Rdkit Extended Connectivity FingerPrints: http://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints .. _NCBI BioSystems: https://www.ncbi.nlm.nih.gov/Structure/biosystems/docs/biosystems_about.html .. _DisGeNET: http://www.disgenet.org/web/DisGeNET/menu/dbinfo .. |aniceto| replace:: Aniceto, N, et al. A novel applicability domain technique for mapping predictive reliability across the chemical space of a QSAR: Reliability-density neighbourhood. J. Cheminform. 8: 69 (2016). |aniceto_doi| .. |aniceto_doi| image:: https://img.shields.io/badge/doi-10.1186%2Fs13321--016--0182--y-blue.svg :target: https://doi.org/10.1186/s13321-016-0182-y .. |mervin2015| replace:: Mervin, L H., et al. Target prediction utilising negative bioactivity data covering large chemical space. J. Cheminform. 7: 51 (2015). |mervin2015_doi| .. |mervin2015_doi| image:: https://img.shields.io/badge/doi-10.1186%2Fs13321--015--0098--y-blue.svg :target: https://doi.org/10.1186/s13321-015-0098-y .. |mervin2016| replace:: Mervin, L H., et al. Understanding Cytotoxicity and Cytostaticity in a High-Throughput Screening Collection. ACS Chem. Biol. 11: 11 (2016) |mervin2016_doi| .. |mervin2016_doi| image:: https://img.shields.io/badge/doi-10.1021%2Facschembio.6b00538-blue.svg :target: https://doi.org/10.1021/acschembio.6b00538
.. |mervin2018| replace:: Mervin, L H., et al. Orthologue chemical space and its influence on target prediction. Bioinformatics. 34: 72–79 (2018). |mervin2018_doi| .. |mervin2018_doi| image:: https://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbtx525-blue.svg :target: https://doi.org/10.1093/bioinformatics/btx525 .. |rogers| replace:: Rogers D & Hahn M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50: 742-54 (2010). |rogers_doi| .. |rogers_doi| image:: https://img.shields.io/badge/doi-10.1021/ci100050t-blue.svg :target: http://dx.doi.org/10.1021/ci100050t .. _GitHub: https://github.com/lhm30/PIDGINv3 .. _Readthedocs: https://pidginv3.readthedocs.io/en/latest/ .. _flatkinson standardiser: https://github.com/flatkinson/standardiser .. _models.zip: .. |license| image:: https://img.shields.io/badge/license-GPLv3-blue.svg :target: https://github.com/lhm30/PIDGINv3/blob/master/LICENSE.txt .. |docstatus| image:: https://readthedocs.org/projects/pidginv3/badge/?version=latest :target: https://pidginv3.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status .. |betarelease| image:: https://zenodo.org/badge/142870938.svg :target: https://zenodo.org/badge/latestdoi/142870938