torchdrug icon indicating copy to clipboard operation
torchdrug copied to clipboard

[Feature Request] Faster loading of csv files into MoleculeDataset objects

Open manangoel99 opened this issue 2 years ago • 1 comments

In the current implementation of the load_csv and load_smiles methods of the MoleculeDataset, the smiles are converted to graphs and molecule objects sequentially. This can be made significantly faster by using a library like joblib which can allow parallel processing.

There can be an added argument to the load_csv or load_smiles called n_jobs which would be 1 by default and the user can then specify how many threads they would like to use.

I would like to make this contribution if the authors approve!

manangoel99 avatar Jan 24 '22 10:01 manangoel99

Hi! We don't have parallel implementation but we do have some workaround for loading large datasets. Many datasets in our library have a lazy mode, where you only load the smiles (which takes nearly no time) and construct the molecules on-the-fly in PyTorch dataloaders. Then you can apply the multiprocess in PyTorch dataloaders.

I am happy with a more straightforward parallel implementation in load_csv, as it would be more intuitive than the above workaround.

KiddoZhu avatar Jan 27 '22 02:01 KiddoZhu