neuraldecipher
neuraldecipher copied to clipboard
Implementation of the paper "Neuraldecipher - Reverse-engineering extended-connectivity fingerprints (ECFPs) to their molecular structures" by Tuan Le, Robin Winter, Frank Noé and Djork-Arné Clevert
Neuraldecipher
Implementation of the Paper "Neuraldecipher - Reverse-engineering extended-connectivity fingerprints (ECFPs) to their molecular structures" by Tuan Le, Robin Winter, Frank Noé and Djork-Arné Clevert.1

Installation
Prerequisites: python==3.6.10
rdkit==2020.03.2
numpy==1.18.1
tqdm==4.46.1
h5py==2.10.0
jupyter==1.0.0
Conda
Create a new enviorment:
git clone URL
cd neuraldecipher
conda env create -f environment.yml
conda activate neuraldecipher
pytorch==1.4.0 (GPU with cuda10 or CPU)
conda install pytorch==1.4.0 torchvision==0.5.0 -c pytorch # GPU
# conda install pytorch==1.4.0 torchvision cpuonly -c pytorch # CPU
Dependency for encoding and decoding SMILES representations
- cddd
To complete the reverse-engineering workflow, the decoder network from Winter et al. (see Workflow) is needed in the final evaluation. Note, it suffices to clone thecdddrepository and start from the installation oftensorflow-gpu==1.10.0without creating the environment. It is important to have thecdddmodule installed within theneuraldecipherenvironment for latter inference. To use tensordboard with pytorch, remove thetensorboard==1.10.0from the cddd dependency
pip uninstall tensorboard
pip install tensorboard==1.14.0
We included this workaround to still be able to use the CDDD inference server and tensorboard to log the training of the Neuraldecipher.
The CDDD server is also needed to compute the CDDD vector representation from the SMILES to train the Neuraldecipher.
We provided a Jupyter Notebook insource/get_cddd.ipynbto compute the CDDD representations from the ChEMBL25 dataset.
Repository structure
The repository consists of several subdirectories:
dataconsists of the training and test data.logsconsists of the tensorboard log files for each training runparamsconsists of the json parameter files for each run. See example.modelsconsists of the saved models. In case the Neuraldecipher was trained on bit-ECFPs, the results are saved inmodels/bits_results. Otherwise the models are saved inmodels.sourceconsists of all necessary python scripts for execution.
The provided data consists of:
data/smiles.npy: List of SMILES from the filtered ChEMBL25 database saved as numpy array.data/smiles_temporal.npy: List of temporal SMILES from the filtered ChEMBL26 database saved as numpy array.data/cluster.npy: List of cluster assignment from thesmiles.npyarray. This array is needed to create train and validation datasets.
Getting started
Computing several extended-connectivity fingerprints (ECFPs) depending on length k and bond diameter d
The python script in source/get_ecfp.py computes the extended-connectivity fingerprints.
The options for the script are the following:
--all: Boolean flag whether or not all ECFP configurations as described in the paper1 should be computed. Defaults to False. In this case on the ECFP with bond-diameter d=6 and fingerprint size k=1024 are computed for the binary and count representations.
--nworkers: Integer of number of parallel cpu-workers to use in order to compute the ECFP representations. Defaults to 1. In order to speed up the computation, it is recommended to use more workers.
Execution:
python source/get_ecfp.py -h # in order to see the information for the arguments
python source/get_ecfp.py --all False --nworkers 10 # only compute one ECFP setting and use 10 cpus for multiprocessing
Computing CDDD representations
The Jupyter Notebook in source/get_cddd.ipynb shows how to generate CDDD representations from the data/smiles.npy array.
Training the Neuraldecipher model
The python script in source/main.py excutes the training for the Neuraldecipher.
The options for the script are the following:
--config: String to the params.json file that consists the information for Neuraldecipher network architecture and training settings. Defaults to params/1024_config_bit_gpu.json
--split: String to select if the cluster or random split should be used (see reference 1) for details.
Defaults to cluster.
--workers: Integer of number of parallel cpu-workers for the dataloader. Defaults to 5
--cosineloss: Boolean flag whether or not the cosineloss should be used within the training. Defaults to False. This flag can be set to True to additionally add the cosine similarity loss next to the difference loss (e.g. L2, or logcosh).
Execution:
python source/main.py -h # in order to see the information for the arguments
python source/main.py --config params/1024_config_bit_gpu.json --split cluster --workers 5 --cosineloss False
Monitoring the training
Since tensorboard-gpu==1.10.0 is installed within the neuraldecipher environment, we cannot run tensorboard==1.14.0 within the neuraldecipherenvironment. We merely included tensorboard==1.14.0 to the neuraldecipher environment to log the training of our Neuraldecipher.
To monitor the training, please create a new environment tb and install tensorflow==1.14.0 (CPU version) which also includes tensorboard==1.14.0 in its installation.
conda create -n tb python=3.6.10 tensorflow==1.14.0
conda activate tb
Run tensorboard command in a new shell (here to localhost:8888):
tensorboard --logdir logs/ --port 8888 --host localhost
Evaluationg the trained model
We provide the model weights for the trained model on ECFP6 representations of length 1024 trained on the cluster split and show the performance on the
cluster validation dataset and temporal dataset in in the Notebook source/evaluation.ipynb.
References
[1] T. Le, R. Winter, F. Noe and D. Clevert, Chem. Sci., 2020, DOI: 10.1039/D0SC03115A