equidock_public icon indicating copy to clipboard operation
equidock_public copied to clipboard

about preprocess_raw_data.py

Open lijiashan2020 opened this issue 3 years ago • 5 comments

When I run the command as follows:

python preprocess_raw_data.py -n_jobs 60 -data dips -graph_nodes residues -graph_cutoff 30 -graph_max_neighbor 10 -graph_residue_loc_is_alphaC -pocket_cutoff 8 -data_fraction 1.0

it can generate six files in the directory /extendplus/jiashan/equidock_public/src/cache/dips_residues_maxneighbor_10_cutoff_30.0_pocketCut_8.0/cv_0 with files

label_test.pkl  ligand_graph_test.bin  receptor_graph_test.bin
label_val.pkl   ligand_graph_val.bin   receptor_graph_val.bin

However, three more files could not be generated successfully, and report errors as follows:

Processing  ./cache/dips_residues_maxneighbor_10_cutoff_30.0_pocketCut_8.0/cv_0/label_frac_1.0_train.pkl
Num of pairs in  train  =  39901
Killed

Could you help me solve this problem? Thanks!

lijiashan2020 avatar Apr 12 '22 02:04 lijiashan2020

Generating the full DIPS training data takes a lot of time and you have to check if you have enough resources for it. Can you try generating just a fraction of it first, e.g., -data_fraction 0.1 ?

octavian-ganea avatar Apr 14 '22 17:04 octavian-ganea

Thank you for your reply! I can successfully run the command by modifying parameters! Thank you very much for help!

lijiashan2020 avatar Apr 19 '22 14:04 lijiashan2020

Thank you for your reply! I can successfully run the command by modifying parameters! Thank you very much for help!

I run it with 160GB RAM for five hours, still failed get the same error. that's really nedd a huge resources. mark it hope usefull for others

lizhenping avatar Nov 30 '22 12:11 lizhenping

marke it , i used 25 cpu 400GB RAm processed for 15 hours.

lizhenping avatar Dec 01 '22 02:12 lizhenping

I had the same problem. The main reason for this is insufficient memory. The pre-processing of the training data of DIPS dataset did require a large amount of memory, which I could not complete this at one time with a server with 256G memory.

One way is to batch. /DIPS/data/DIPS/interim/pairs-pruned/pairs-postprocessed-train.txt stores all the PDB files waiting to be pre-processed. So, you can divide the txt file into several parts and preprocessing respectively. After this you just need to merge the generated files together. I divided the training data into two parts and finished the pre-processing successfully with 256G memory server.

Octopus125 avatar Jan 19 '23 03:01 Octopus125