ChemBench
ChemBench copied to clipboard
MoleculeNet benchmark dataset & MolMapNet dataset
In case you would like to cite this:
1. MolMapNet Dataset
- the following datasets are reported in the paper of
"Out-of-the-Box Deep Learning Prediction of Pharmaceutical Properties by Broadly Learned Knowledge-Based Molecular Representations", please find details of these datasets in this paper
| Data Class | Dataset | No. of Molecules | No. of Tasks | Task Metric | Task Type |
|---|---|---|---|---|---|
| Physico-chemical | ESOL Water solubility | 1128 | 1 | RMSE | Regression |
| FreeSolv Solvation free energy | 642 | 1 | RMSE | Regression | |
| Lipop Lipophilicity | 4200 | 1 | RMSE | Regression | |
| Molecular binding | PDBbind-F, PDBbind-C, PDBbind-R Ligand-protein binding: full, core, refined (3 datasets) | 9880, 168, 3040 | 1 for each | RMSE | Regression |
| Bio-activity | PCBA PubChem HTS bioAssay | 437929 | 128 | PRC-AUC | Classification |
| MUV PubChem bioAssay | 93087 | 17 | PRC-AUC | Classification | |
| ChEMBL bioassay activity dataset | 456331 | 1310 | ROC_AUC | Classification | |
| Cancer cell-line IC50 A2780, CCRF-CEM12, DU-14512, HCT-1512, KB12, LoVo12, PC-312, SK-OV-312 (8 datasets) | 2255, 3047, 2512,994, 2731, 1120, 4294, 1589 | 1 for each | R2 | Regression | |
| Malaria Anti-malarial EC508 | 9998 | 1 | RMSE | Regression | |
| BACE-1 benchmark set, ChEMBL novel set, ChEMBL common set, Clinical drugs | 1513, 395, 5324, 26 | 1 | ROC_AUC | Classification | |
| HIV replication inhibition | 41127 | 1 | ROC_AUC | Classification | |
| Toxicity | Tox21Toxicology in the 21st century | 7831 | 12 | ROC_AUC | Classification |
| SIDER Adverse drug reactions of marketed drugs | 1427 | 27 | ROC_AUC | Classification | |
| ClinTox Clinical trial toxicity | 1478 | 2 | ROC_AUC | Classification | |
| Pharmacokinetic | CYP PubChem BioAssay CYP 1A2, 2C9, 2C19, 2D6, 3A4 inhibition | 16896 | 5 | ROC_AUC | Classification |
| LMC-H, LMC-R, LMC-M (Liver Mocrosomal Clearance in human, rat, mouse) | 8755 | 3 | R2 | Regression | |
| BBBP Blood-brain barrier penetration | 2039 | 1 | ROC_AUC | Classification |
2. Benchmark DataSet in MolNet and Chemprop
These benchmark datasets and the split induces have benn generated in this repo, the following table is the summary of these datasets.
| task_name | task_type | n_samples | n_task | split_method | n_cross_split | task_metrics | |
|---|---|---|---|---|---|---|---|
| task_id | |||||||
| 01 | ESOL | regression | 1128 | 1 | random | 3 | RMSE |
| 02 | FreeSolv | regression | 642 | 1 | random | 3 | RMSE |
| 03 | Lipop | regression | 4200 | 1 | random | 3 | RMSE |
| 04 | PDBbind-full | regression | 9880 | 1 | time | 1 | RMSE |
| 05 | PDBbind-core | regression | 168 | 1 | time | 1 | RMSE |
| 06 | PDBbind-refined | regression | 3040 | 1 | time | 1 | RMSE |
| 07 | PCBA | classification | 437929 | 128 | random | 3 | PRC_AUC |
| 08 | MUV | classification | 93087 | 17 | random | 3 | PRC_AUC |
| 09 | HIV | classification | 41127 | 1 | scaffold | 3 | ROC_AUC |
| 10 | BACE | classification | 1513 | 1 | scaffold | 3 | ROC_AUC |
| 11 | BBBP | classification | 2039 | 1 | scaffold | 3 | ROC_AUC |
| 12 | Tox21 | classification | 7831 | 12 | random | 3 | ROC_AUC |
| 13 | ToxCast | classification | 8576 | 617 | random | 3 | ROC_AUC |
| 14 | SIDER | classification | 1427 | 27 | random | 3 | ROC_AUC |
| 15 | ClinTox | classification | 1478 | 2 | random | 3 | ROC_AUC |
| 16 | ChEMBL | classification | 456331 | 1310 | scaffold | 3 | ROC_AUC |
Installation
Direct installation:
pip install git+https://github.com/shenwanxiang/ChemBench.git
Developer installation:
git clone https://github.com/shenwanxiang/ChemBench.git
cd ChemBench
pip install -e .
Usage-1: Load the Dataset and MoleculeNet's Split Induces
from chembench import load_data
df, induces = load_data('ESOL')
# get the 3 times random split induces
train_idx, valid_idx, test_idx = induces[0]
train_idx, valid_idx, test_idx = induces[1]
train_idx, valid_idx, test_idx = induces[2]
Usage-2: Load Dataset As Data Object
from chembench import dataset
data = dataset.load_ESOL()
data.x
data.y
data.description
## regression
dataset.load_Lipop()
dataset.load_ESOL()
dataset.load_FreeSolv()
dataset.load_Malaria()
dataset.load_LMC()
dataset.load_PDBF()
dataset.load_PDBC()
dataset.load_PDBR()
### classification
dataset.load_BBBP()
dataset.load_BACE()
dataset.load_HIV()
dataset.load_MUV()
dataset.load_Tox21()
dataset.load_SIDER()
dataset.load_CYP450()
dataset.load_ToxCast()
dataset.load_ClinTox()
dataset.load_ChEMBL()
dataset.load_PCBA()
Usage-3: Load Cluster Splits
the cluster split results is here, for example, load the cluster splits and random splits for dataset ESOL:
from chembench import get_cluster_induces
induces1 = get_cluster_induces("ESOL", induces = "random_5fcv_5rpts")
induces2 = get_cluster_induces("ESOL", induces = "scaffold_5fcv_1rpts")
print(len(induces1))
print(len(induces2))
For example, the chemical space of the ESOL dataset using 5fold cluster split :

the Kolmogorov-Smirnov statistic on the distribution for the pairwise groups(clusters):

Making a Release
After installing the package in development mode and installing
tox with pip install tox, the commands for making a new release are contained within the finish environment
in tox.ini. Run the following from the shell:
$ tox -e finish
This script does the following:
- Uses BumpVersion to switch the version number in the
setup.cfgandsrc/chembench/version.pyto not have the-devsuffix - Packages the code in both a tar archive and a wheel
- Uploads to PyPI using
twine. Be sure to have a.pypircfile configured to avoid the need for manual input at this step - Push to GitHub. You'll need to make a release going with the commit where the version was bumped.
- Bump the version to the next patch. If you made big changes and want to bump the version by minor, you can
use
tox -e bumpversion minorafter.