openfold Frequently failed in training.

NZQDJ(QMIV%R}0_J @XW`S

I noticed that every time when 3 epochs training were done, the training process failed.

72N8($L1YF07_V_WZM~64FN

Here you can see the process is still running, and memories are still allocated, but GPUs are actually not working. I'm sure during the time memory-usage is always around 60g but it will show me that:

RuntimeError: CUDA out of memory. Tried to allocate 3.12 GiB (GPU 0; 79.20 GiB total capacity; 10.85 GiB already allocated; 885.31 MiB free; 14.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here are my arguments:

    --template_release_dates_cache_path=~~~/openfold/mmcif_cache.json \
    --precision=bf16 \
    --gpus=2 --replace_sampler_ddp=True \
    --seed=42 \
    --deepspeed_config_path=~~~/openfold/deepspeed_config.json \
    --checkpoint_every_epoch \
    --obsolete_pdbs_file_path=~~~/pdb_mmcif/obsolete.dat \
    --max_epochs=100 \
    --train_epoch_len=200\
    --config_preset="model_5_multimer_v3" \
    --num_nodes=1 \
    --train_mmcif_data_cache_path=~~~/openfold/mmcif_cache.json \

Apr 19 '24 00:04 C-de-Furina

I see in the readme they recommend to use mixed precision https://github.com/aqlaboratory/openfold/blob/3c1fd31ac47c8da54d088badd9eba61fe0b3fd26/docs/source/Training_OpenFold.md?plain=1#L129

It looks like a GPU setup problem, did you try the suggestion from the error?

If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

May 19 '24 10:05 eamag

Did you ever solve this?

I'm using an RTX4090 24Gb and getting an OOM error. Playing around with PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:256 values has gotten close to the available memory requirement in the error. I've tried a handful of other values, but it doesn't seem to change much.

Similarly, changing the use_deepspeed_evo_attention variable to true on line 407 of openfold/config.py changes the memory requirement in the error from Gb to Mb, but I don't know if this is actually getting me anywhere closer to a result.

My current error is:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 78.00 MiB. GPU 0 has a total capacty of 23.63 GiB of which 64.50 MiB is free. Including non-PyTorch memory, this process has 23.45 GiB memory in use. Of the allocated memory 22.36 GiB is allocated by PyTorch, and 403.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Parameters:

	~~/20240528/mmcifs/
	~~/20240528/precomputed_merged/
	~~/openfold_cuda121/data/pdb_mmcif/mmcif_files/
	~~/20240528/training_dir/
	2024-06-06
	--template_release_dates_cache_path ~~/openfold_cuda121/data/pdb_mmcif/pdb_mmcif_cache.json
	--train_mmcif_data_cache_path ~~/20240528/mmcif_cache.json
	--precision bf16 ## I've tried this and bf16-merged
	--gpus 1
	--seed 1234
	--deepspeed_config_path ~~/openfold_cuda121/deepspeed_config/rb_workstations_deepspeed_config.json
	--checkpoint_every_epoch
	--obsolete_pdbs_file_path ~~/openfold_cuda121/data/pdb_mmcif/obsolete.dat
	--config_preset model_5_multimer_v3
	--resume_from_jax_params ~~/openfold_cuda121/openfold/resources/params/params_model_5_multimer_v3.npz
	--max_epochs 20
	--num_nodes 1
	--train_epoch_len 50

Jun 06 '24 14:06 dthorburn

openfold openfold copied to clipboard

Frequently failed in training.

openfold
openfold copied to clipboard