PaddleHelix icon indicating copy to clipboard operation
PaddleHelix copied to clipboard

Can I find code for LIT-PCBA dataset's 3D coordinates generation?

Open Sangyeup opened this issue 1 year ago • 1 comments

Hi, did you guys test GEM-2 model on LIT-PCBA by generating 3D coordinates from SMILES string?

If then, can I find a code for it?

Thank you.

Sangyeup avatar Oct 24 '22 12:10 Sangyeup

Hi Sangyeup, we are organizing the training code for LIT-PCBA and will update it later. For now, you can

  1. Implement the LitPCBADataset class with reference to https://github.com/PaddlePaddle/PaddleHelix/blob/02cbefee527acfc979913be178d083518590da90/apps/pretrained_compound/ChemRL/GEM-2/src/dataset.py#L33
  2. Replace the PCQM4Mv2 dataset with the newly implemented LitPCBADataset in function load_data: https://github.com/PaddlePaddle/PaddleHelix/blob/02cbefee527acfc979913be178d083518590da90/apps/pretrained_compound/ChemRL/GEM-2/train_gem2.py#L112
  3. Add litpcba dataset config to the folder configs/dataset_configs (you need to specify where the raw litpcba dataset is like the pcqmv2.json do)
  4. Now you can run the train_gem2.py to generate the 3d data and train GEM-2 with LIT-PCBA. Note that processed data is stored in the data_cache_dir that you pass to the script. Hope this can be helpful to you.

Noisyntrain avatar Oct 26 '22 07:10 Noisyntrain