pytorch_geometric icon indicating copy to clipboard operation
pytorch_geometric copied to clipboard

`MoleculeGPT`: Dataset+Model+Unit tests+Example

Open puririshi98 opened this issue 1 year ago • 3 comments

🚀 The feature, motivation and pitch

Paper: https://ai4d3.github.io/papers/34.pdf Part of the community sprint https://github.com/pyg-team/pytorch_geometric/issues/9694 The goal of this project is to reproduce the work done in MoleculeGPT while tying it as closely to the existing GNN+LLM frameworks in PyG. We recommend using as many existing features as possible from PyG. Additional features which you feel will be reusable for other workflows should be added to PyG. One-off functions that are specific to this workflow can be left inside the example. Most of the effort will likely go into building a PyG dataset that matches the one described in the paper. At a high level the dataset is a composition of Q+A pairs for molecular field, with matching molecules as context. These Q+A pairs focus on molecular property prediction.

Alternatives

No response

Additional context

No response

puririshi98 avatar Oct 08 '24 17:10 puririshi98

Would like to contribute to this paper. Listed what to do, need some discussion for the details^^.

Dataset

  • Format: <SMILES, Instruction, Response>
  • Is there any existing dataset? Or we need to extract and clean from PubChem from scratch.

Model

  • 2D Graph Branch
    • GraphMVP: GIN for 2D, SchNet for 3D, so I think we can use GINConv directly
    • QFormer: Implement torch_geometric.nn.attention.qformer
  • 1D Graph Branch
    • ChemBERTa-2: Use torch_geometric.nn.nlp.llm
    • QFormer: Implement torch_geometric.nn.attention.qformer
  • LLM
    • vicuna-7B-v1.5, Use torch_geometric.nn.nlp.llm
    • Not clear how to fit 1D+2D embedding and instructions to LLM

xnuohz avatar Oct 12 '24 10:10 xnuohz

Is there any existing dataset? Or we need to extract and clean from PubChem from scratch.

Hey @xnuohz sorry for the delay! Just had a quick look at the paper, and it looks like they haven't published the code and dataset that they curated for the paper, but as a general goal, we should aim for reproducing the result from the paper by re-implementing the dataset, preprocessing, and model with an example script.

We can also discuss this in PyG Slack :)

(cc'ing @puririshi98 for when he's back)

akihironitta avatar Oct 15 '24 09:10 akihironitta

@xnuohz I think they seem to follow the this data preprocessing step https://github.com/chao1224/MoleculeSTM/tree/main/data as described in section 3.2 Also the 1D Graph Branch should be 1D SMILES Branch which uses the encoder designed to encode SMILES string https://github.com/seyonechithrananda/bert-loves-chemistry

zechengz avatar Oct 15 '24 21:10 zechengz

Is there any existing dataset?

We have a text-based molecule generation preprint coming out soon. Our version of the PubChem molecule-text pairs dataset is already public. Would that be useful?

One finding in our preprint is that a lot of the text descriptions in PubChem are quite generic and therefore a poor choice for evaluating text-conditioned molecule generation. Maybe they are still okay for the MoleculeGPT instruction-response format. I'm still wary.

agitter avatar Oct 25 '24 03:10 agitter

Hi @agitter Sorry for the late reply. It should be useful and worthing to create another PyG dataset. For this MoleculeGPT paper, we should try to use the original data generation method as much as possible.

xnuohz avatar Oct 27 '24 14:10 xnuohz

this is closed as the PR is merged

puririshi98 avatar Nov 25 '24 05:11 puririshi98