torchdrug icon indicating copy to clipboard operation
torchdrug copied to clipboard

[Feature Request] The Program runs extremely slowly when using `ProteinDataset.load_pdbs()`

Open mrzzmrzz opened this issue 2 years ago • 2 comments

When I use the EnzymeCommission class to construct one protein dataset(about 18K proteins), the function load_pdbs() executes very slowly (around two hours). I'd like to know whether a code refactoring for supporting the multiprocessing execution is possible.

Besides, the function save_pickle in this class also runs slowly. It takes me about 40 minutes to save the pkl.gz file. However, when I directly use the pickle.dump() function to save a tuple containing the pdb_files, data, sequences data. It runs fast and costs less than one minute. BUT the saved file size is larger than the current method(about 8 times larger). I'd like to know whether a balanced method could run fast and save disk space.

mrzzmrzz avatar Oct 19 '22 15:10 mrzzmrzz

Hi! Thanks for raising this issue!

The dataset pre-loading part is very time-consuming, so we'd like to only do it when we first process the dataset and save the file in a .pkl.gz file. The multiprocessing execution is a good idea! We don't include this in the first version in order to keep the code simple. Nevertheless, you can write this part and make a pull request.

For save_pickle, we think that the current implementation is a good solution. Because you only need to save the file once, it's better to sacrifice time to save space.

Oxer11 avatar Oct 20 '22 00:10 Oxer11

OK, maybe I will make a pull request recently.

Besides, I found the loaded file processed by the save_pickle is about 44GB in the memory. The memory usage is about 50GB when adapting the directly pickle.dump method. Overall, it seems like that both of the two methods perform similarly to some degree.

mrzzmrzz avatar Oct 21 '22 11:10 mrzzmrzz