PCMol
PCMol copied to clipboard
Multi-target de novo molecular generator conditioned on AlphaFold's latent protein embeddings.
PCMol
A multi-target model for de novo molecule generation. By using the internal protein representations of the AlphaFold[1] model, a single SMILES-based transformer can generate relevant molecules for thousands of protein targets (embeddings are available for 4,331 proteins).
The model was trained using bioactivity data from the Papyrus[2] dataset (661,613 unique protein-ligand pairs in total, 6,249,253 after augmentation).

Paper & Authors
The preprint is available on ChemRxiv:
https://chemrxiv.org/engage/chemrxiv/article-details/65d47632e9ebbb4db9c63988

Installation
1. Setup script (recommended)
The setup script will install the required dependencies and download the pretrained model.
git clone https://github.com/CDDLeiden/pcmol.git && cd pcmol
chmod +x setup.sh
bash setup.sh
2. Conda (alternative)
The conda route requires the user to download the pretrained model manually (link below).
# Setting up a fresh conda environment
git clone https://github.com/CDDLeiden/pcmol.git && cd pcmol
conda env create -f environment.yml && conda activate pcmol
python -m pip install -e .
Pretrained model
*When not using the setup script, the pretrained model can be downloaded from here (mirror). It should then be placed in the .../pcmol/data/models folder.
Generating molecules for a particular target
1. Using a script (conda route)
# Run the model on a single target using Accession ID (generates 10 SMILES strings)
conda activate pcmol
python pcmol/generate.py --target P29275
# If GPU is not available
python pcmol/generate.py --target P29275 --device cpu
If available, the appropriate AlphaFold2 embeddings to be used as input to the model will be downloaded automatically. The generated molecules are saved in the data/results folder.
2. Calling the generator directly
To generate molecules for a particular target, the Runner class can be used directly. The generate_smiles method returns a list of SMILES strings for a target protein specified by its Accession ID.
from pcmol import Runner
model = Runner(model="XL")
SMILES = model.targetted_generation(target="P29275", num_mols=100)
List of supported protein targets
The model currently depends on the availability of AlphaFold2 embeddings for the target protein. The list of supported targets can be found in the data/targets.txt file.
References
[1]: Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589.
[2]: Béquignon, O. J., Bongers, B. J., Jespers, W., IJzerman, A. P., van der Water, B., & van Westen, G. J. (2023). Papyrus: a large-scale curated dataset aimed at bioactivity predictions. Journal of cheminformatics, 15(1), 3.