Very slow in preprocess to generate the h5 file
I noticed that it is very very slow in generating the h5 file using the preprocessing script. I tried to create a training set from mini_chembl which contains around 160000 molecules. I ran it on a i7 cpu desktop, it has run for more than 24 hours and I can check from the restart_index that it has only processed 50000 molecules. What's even worse, I noticed that it runs slower and slower as the size of the h5 file increases. For small set of molecules, for example, 1000 molecules in validation set, it only takes several minutes to preprocess. However, as the train.h5.chunk size increases, now, 1000 molecules will take 2 hours. Is this expected? or is there anything wrong with my setting or my hardware?
The params are posted below: atom type: C N O F S Cl Br chirality: None R S formal charge -1,0,1 ignore H: True imp_H: 0 1 2 3 max_n_node: 70 use_aromatic_bond: False use chirality: False use explicit H: False
Indeed preprocessing large datasets can be very slow in GraphINVENT. There is not yet a good implementation for preprocessing the training data in a parallel way, but in the meantime we have created a script for splitting the dataset and running separate preprocessing jobs (effectively doing the preprocessing in a parallel way). That script is available at tools/submit-split-preprocessing-supercloud.py. You will have to modify it for your own dataset.
I hope to update the source code soon with a better/parallel implementation for the preprocessing that handles all this automatically.