openfold
openfold copied to clipboard
Running inference using MultiGPU
Hi everyone,
I am trying to run inference using multiGPU. I am currently able to run it on a single gpu which is by default using parser.add_argument( "--model_device", type=str, default="cpu", help="""Name of the device on which to run the model. Any valid torch device name is accepted (e.g. "cpu", "cuda:0")""" where I can use cuda:0 for gpu 0 and cuda:1 for gpu 1 etc.
But i did not find an argument for using distributed gpu argument as in train_openfold.py in gpu argument
Thanks in advance
Hi @chinmayraneQ, this is my guess (and others who know better may correct it): There is no such argument. As things are implemented, the model must fit in the GPU RAM. During training, multiple GPUs are used sort-of independently on independent training examples and then the incremental changes are combined. It is like a series of jobs done on either one GPU or more. Alternatively, to fit a really big jobs, one could borrow RAM from additional GPUs if they are available and have fast RAM interconnect. (I am not sure how well this is supported by unified memory. For sure it can borrow RAM from the CPU. But training avoids such need by working on parts of proteins.) I guess you have these two options for inference, too:
- If you have several jobs and multiple GPUs, just run multiple processes, each using one of the GPUs.
- If you have a really big job, you might be able to use one GPU with additional RAM borrowed from the other (again, not sure this is really well supported by drivers - certainly you can borrow CPU RAM). For inference with long sequences this may be necessary.
Which of the two cases is yours?
Thanks @vaclavhanzl for your reply. So you mean i have to have run the inference multiple gpus by running inference command individually in each terminal with setting "Cuda:0" "Cuda:1"?
Currently i am just in a intial phase for testing the enviornment and i was just using one file - https://rest.uniprot.org/uniprotkb/P06214.fasta which ran successfully on one V100 Gpu using 100% out of 4 V100. Then I tried one more fasta file in the hopes that it will probably use 2 gpus for each of the file inference, but i got an error as follows
NFO:/workspace/openfold/openfold/utils/script_utils.py:Loaded OpenFold parameters at /workspace/models/openfold_params/finetuning_ptm_2.pt...
INFO:run_pretrained_openfold.py:Generating alignments for sp|P06214|HEM2_RAT...
Traceback (most recent call last):
File "run_pretrained_openfold.py", line 401, in
So I believe we have to separate the folders of each sequence files and run individually on each gpus?
Thanks again for your quick response
I guess it failed while computing MSA, before doing anything with GPU(s). Maybe you could share your full command line?
And yes, totally separated runs on "Cuda:0" "Cuda:1" is what I was trying to suggest.
Sure, I am using the same run_pretrained_openfold.py command
python3 run_pretrained_openfold.py /workspace/fasta_dir /workspace/dataset/pdb_mmcif/mmcif_files/ --uniref90_database_path /workspace/dataset/uniref90/uniref90_fasta.fasta --mgnify_database_path /workspace/dataset/mgnify/mgy_clusters_2018_12.fa --pdb70_database_path /workspace/dataset/pdb70/pdb70 --uniclust30_database_path /workspace/dataset/uniclust30/uniclust30_2018_08/uniclust30_2018_08 --output_dir ./ --bfd_database_path /workspace/dataset/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --model_device "cuda:3" --jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer --hhblits_binary_path lib/conda/envs/openfold_venv/bin/hhblits --hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign --config_preset "model_1_ptm" --openfold_checkpoint_path /workspace/models/openfold_params/finetuning_ptm_2.pt
I havent tried the --long_sequence_inference argument yet
I'd double-check format of things in /workspace/fasta_dir . Also, can you please find out version of the installed jackhmmer?
-
I have used wget directly from https://rest.uniprot.org/uniprotkb/P06214.fasta. As mentioned it worked for one file but the error was for 2 files.
-
I did not get the previous error if i used
--long_sequence_inference
. It processed sequentially on one gpu and also used less GPU utilization. i will be trying 3 sequences with same length.
Also I created a issues that I am facing with training here where i cannot compute alignments. Any suggestions
https://github.com/aqlaboratory/openfold/issues/313