shenwanxiang/ChemBench: MoleculeNet benchmark dataset & MolMapNet dataset

In case you would like to cite this:

1. MolMapNet Dataset

the following datasets are reported in the paper of "Out-of-the-Box Deep Learning Prediction of Pharmaceutical Properties by Broadly Learned Knowledge-Based Molecular Representations" , please find details of these datasets in this paper

Data Class	Dataset	No. of Molecules	No. of Tasks	Task Metric	Task Type
Physico-chemical	ESOL Water solubility	1128	1	RMSE	Regression
	FreeSolv Solvation free energy	642	1	RMSE	Regression
	Lipop Lipophilicity	4200	1	RMSE	Regression
Molecular binding	PDBbind-F, PDBbind-C, PDBbind-R Ligand-protein binding: full, core, refined (3 datasets)	9880, 168, 3040	1 for each	RMSE	Regression
Bio-activity	PCBA PubChem HTS bioAssay	437929	128	PRC-AUC	Classification
	MUV PubChem bioAssay	93087	17	PRC-AUC	Classification
	ChEMBL bioassay activity dataset	456331	1310	ROC_AUC	Classification
	Cancer cell-line IC50 A2780, CCRF-CEM12, DU-14512, HCT-1512, KB12, LoVo12, PC-312, SK-OV-312 (8 datasets)	2255, 3047, 2512,994, 2731, 1120, 4294, 1589	1 for each	R2	Regression
	Malaria Anti-malarial EC508	9998	1	RMSE	Regression
	BACE-1 benchmark set, ChEMBL novel set, ChEMBL common set, Clinical drugs	1513, 395, 5324, 26	1	ROC_AUC	Classification
	HIV replication inhibition	41127	1	ROC_AUC	Classification
Toxicity	Tox21Toxicology in the 21st century	7831	12	ROC_AUC	Classification
	SIDER Adverse drug reactions of marketed drugs	1427	27	ROC_AUC	Classification
	ClinTox Clinical trial toxicity	1478	2	ROC_AUC	Classification
Pharmacokinetic	CYP PubChem BioAssay CYP 1A2, 2C9, 2C19, 2D6, 3A4 inhibition	16896	5	ROC_AUC	Classification
	LMC-H, LMC-R, LMC-M (Liver Mocrosomal Clearance in human, rat, mouse)	8755	3	R2	Regression
	BBBP Blood-brain barrier penetration	2039	1	ROC_AUC	Classification

2. Benchmark DataSet in MolNet and Chemprop

These benchmark datasets and the split induces have benn generated in this repo, the following table is the summary of these datasets.

	task_name	task_type	n_samples	n_task	split_method	n_cross_split	task_metrics
task_id
01	ESOL	regression	1128	1	random	3	RMSE
02	FreeSolv	regression	642	1	random	3	RMSE
03	Lipop	regression	4200	1	random	3	RMSE
04	PDBbind-full	regression	9880	1	time	1	RMSE
05	PDBbind-core	regression	168	1	time	1	RMSE
06	PDBbind-refined	regression	3040	1	time	1	RMSE
07	PCBA	classification	437929	128	random	3	PRC_AUC
08	MUV	classification	93087	17	random	3	PRC_AUC
09	HIV	classification	41127	1	scaffold	3	ROC_AUC
10	BACE	classification	1513	1	scaffold	3	ROC_AUC
11	BBBP	classification	2039	1	scaffold	3	ROC_AUC
12	Tox21	classification	7831	12	random	3	ROC_AUC
13	ToxCast	classification	8576	617	random	3	ROC_AUC
14	SIDER	classification	1427	27	random	3	ROC_AUC
15	ClinTox	classification	1478	2	random	3	ROC_AUC
16	ChEMBL	classification	456331	1310	scaffold	3	ROC_AUC

Installation

Direct installation:

pip install git+https://github.com/shenwanxiang/ChemBench.git

Developer installation:

git clone https://github.com/shenwanxiang/ChemBench.git
cd ChemBench
pip install -e .

Usage-1: Load the Dataset and MoleculeNet's Split Induces

from chembench import load_data
df, induces = load_data('ESOL')

# get the 3 times random split induces
train_idx, valid_idx, test_idx = induces[0]
train_idx, valid_idx, test_idx = induces[1]
train_idx, valid_idx, test_idx = induces[2]

Usage-2: Load Dataset As Data Object

from chembench import dataset
data = dataset.load_ESOL()
data.x
data.y
data.description


## regression 
dataset.load_Lipop()
dataset.load_ESOL()
dataset.load_FreeSolv()
dataset.load_Malaria()
dataset.load_LMC()
dataset.load_PDBF()
dataset.load_PDBC()
dataset.load_PDBR()


### classification
dataset.load_BBBP()
dataset.load_BACE()
dataset.load_HIV()
dataset.load_MUV()
dataset.load_Tox21()
dataset.load_SIDER()
dataset.load_CYP450()
dataset.load_ToxCast()
dataset.load_ClinTox()
dataset.load_ChEMBL()
dataset.load_PCBA()

Usage-3: Load Cluster Splits

the cluster split results is here, for example, load the cluster splits and random splits for dataset ESOL:

from chembench import get_cluster_induces
induces1 = get_cluster_induces("ESOL", induces = "random_5fcv_5rpts")
induces2 = get_cluster_induces("ESOL", induces = "scaffold_5fcv_1rpts")
print(len(induces1))
print(len(induces2))

For example, the chemical space of the ESOL dataset using 5fold cluster split : ESOL split chemical space

the Kolmogorov-Smirnov statistic on the distribution for the pairwise groups(clusters): ESOL split distribution test

Making a Release

After installing the package in development mode and installing tox with pip install tox, the commands for making a new release are contained within the finish environment in tox.ini. Run the following from the shell:

$ tox -e finish

This script does the following:

Uses BumpVersion to switch the version number in the setup.cfg and src/chembench/version.py to not have the -dev suffix
Packages the code in both a tar archive and a wheel
Uploads to PyPI using twine. Be sure to have a .pypirc file configured to avoid the need for manual input at this step
Push to GitHub. You'll need to make a release going with the commit where the version was bumped.
Bump the version to the next patch. If you made big changes and want to bump the version by minor, you can use tox -e bumpversion minor after.

ChemBench
ChemBench copied to clipboard

Metadata

1. MolMapNet Dataset

2. Benchmark DataSet in MolNet and Chemprop

Installation

Usage-1: Load the Dataset and MoleculeNet's Split Induces

Usage-2: Load Dataset As Data Object

Usage-3: Load Cluster Splits

Making a Release

← Metadata

Owner

Metadata

ChemBench ChemBench copied to clipboard

Metadata

1. MolMapNet Dataset

2. Benchmark DataSet in MolNet and Chemprop

Installation

Usage-1: Load the Dataset and MoleculeNet's Split Induces

Usage-2: Load Dataset As Data Object

Usage-3: Load Cluster Splits

Making a Release

← Metadata

Owner

Metadata

ChemBench
ChemBench copied to clipboard