torchdrug
torchdrug copied to clipboard
[Feature Request] The Program runs extremely slowly when using `ProteinDataset.load_pdbs()`
When I use the EnzymeCommission class to construct one protein dataset(about 18K proteins), the function load_pdbs()
executes very slowly (around two hours). I'd like to know whether a code refactoring for supporting the multiprocessing execution is possible.
Besides, the function save_pickle
in this class also runs slowly. It takes me about 40 minutes to save the pkl.gz
file. However, when I directly use the pickle.dump()
function to save a tuple containing the pdb_files, data, sequences data. It runs fast and costs less than one minute. BUT the saved file size is larger than the current method(about 8 times larger). I'd like to know whether a balanced method could run fast and save disk space.
Hi! Thanks for raising this issue!
The dataset pre-loading part is very time-consuming, so we'd like to only do it when we first process the dataset and save the file in a .pkl.gz
file. The multiprocessing execution is a good idea! We don't include this in the first version in order to keep the code simple. Nevertheless, you can write this part and make a pull request.
For save_pickle
, we think that the current implementation is a good solution. Because you only need to save the file once, it's better to sacrifice time to save space.
OK, maybe I will make a pull request recently.
Besides, I found the loaded file processed by the save_pickle
is about 44GB in the memory. The memory usage is about 50GB when adapting the directly pickle.dump
method. Overall, it seems like that both of the two methods perform similarly to some degree.