Is is possible to release the GEOMDrugsDataset processed files ?
Hello,
I'm trying to use MiDi to generate molecules based on the model trained on GEOM with explicit H. The trained model requires the dataset_infos as input, which needs the datamodule to get the statistics. However, I currently don't have enough RAM on my machine to load the training set of GEOM in the pickle file you provide. I was thinking that probably having the processed files for the GEOMDrugsDataset could avoid the process() function (that is run when the processed files don't exist) and these files could be lighter than the whole pickle file containing molecules? Can you provide those ? Or if you see another workaround (i.e. separating the statistics/configuration required for the dataset_infos in other files that do not always require the datamodule), please let me know?
Thank you very much,
Best, Benoit
@cvignac I am in the same position
Hello, the processed file is much heavier unfortunately, which is why I included the raw one.
Another problem that you might encounter is not being able to load all the processed dataset at once. Currently the dataset uses the InMemoryDataset class of pytorch geometric, and this would probably need to be changed.
I will think about ways to fix this, but I won't have much time in the coming days.
Clement
Hi @cvignac and @nichrun, I found a workaround. I was able to load the file and open it on the HPC of my university. I was able to extract the list of smiles for the train, validation and test set molecules, and I modified the code to be able to read them from the raw GEOM rdkit_folder. GEOM tutorial here: https://github.com/learningmatter-mit/geom GEOM download here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/JNGTDF
Place the 3 following files in data/raw corresponding to InMemoryDataset paths. train_smiles.txt val_smiles.txt test_smiles.txt
geom_dataset.txt Rename to geom_dataset.py and place in src/data/ Adjust the paths in the files accordingly to indicate where GEOM data are stored