DALI
DALI copied to clipboard
Running into memory issues in a Hydra sweep
@JanuszL Hope all is well on your end.
I have a pipeline that uses Pytorch Lightning + DALI to do pose-estimation. Trained many models successfully by now. We are using Hydra to orchestrate our configs and perform sequential sweeps.
When I run 5-6 sequential jobs using Hydra's multirun
-- all goes well.
The problem arises when I run a larger sweep over 20 models, as follows
python scripts/train_hydra.py --multirun training.train_frames=75,100,150,200 \
'model.losses_to_use=[],[pca_multiview],[temporal],[unimodal],[pca_singleview]' \
training.max_epochs=2
Where training.max_epochs=2
first verifies that each of the 20 models above can successfully train for 2
epochs (later i'll drop this argument).
I run into memory issues.
Could not allocate physical storage of size 123535360 on device 0cuda_vm_resource stat dump:
total VM size: 12884901888
currently allocated: 4545120864
peak allocated: 4545120864
allocated_blocks: 69
block size: 67108864
non-freed allocations: 92
total allocations: 132
total deallocations: 40
total unmapping: 0
free pool size: 18446744071829793952Pool map:
================================
VA region 0000040000000000 : 0000040100000000
0000040000000000 In use Mapped 00007F58F4018B30
0000040004000000 In use Mapped 00007F58F4018E90
0000040008000000 In use Mapped 00007F58F0005230
000004000C000000 In use Mapped 00007F58F0005590
0000040010000000 In use Mapped 00007F58F001B040
0000040014000000 In use Mapped 00007F58F03F7240
0000040018000000 In use Mapped 00007F58F0404850
Note that I used the exact same batch_size
, sequence_length
and file_name
for those 5-6 models that trained successfully.
Not sure if my problem is Hydra related? DALI related? something about the multiple models or DALI loaders seems to be affecting memory I think.
Two things to note:
I can report that my lab mate ran a similar sweep overnight without Hydra multirun
(on a smaller video), and didn't run into memory issues.
I saw this DALI issue: https://issueexplorer.com/issue/NVIDIA/DALI/3387
LMK what you think, Dan
Hi @danbider,
It looks you run out of memory. Maybe one of your models consumes more memory. It is hard to tell. What you can do is to ask Pytorch to release the GPU memory after/before each training - gc.collect()
. It could be that in (comparing to your lab mate) the other case, due to randomness, DALI consumed just a little less memory.
Thanks @JanuszL, I agree with your diagnostics. This error is not consistent, there seems to be some randomness about when it occurs; I'll keep you posted.
@danbider any luck on this?
For me it appears to be the case that the gpu's memory usage accumulates across runs. I.e. the fist run is fine but the second run requires twice the memory, the third run 3 times as much... etc.
torch.cuda.empty_cache()
at the end of the main function does not help, and neither does gc.collect()
Hi @alvitawa,
For me it appears to be the case that the gpu's memory usage accumulates across runs. I.e. the fist run is fine but the second run requires twice the memory, the third run 3 times as much... etc.
The overall memory consumption may slightly increase after the first iteration as the second batch may be just bigger. However, it is not expected to see this raise in consumption across the runs. Maybe the instance of the DALI pipeline is still alive from the previous run and the memory consumption just adds up. Do you have a minimal and standalone repro we can run on our own?
I have also encountered the same issue. In fact, I believe the problem lies in the multirun mode of Hydra, which invokes joblib. By default, joblib sets n_jobs to -1, and that's where the problem arises. Joblib does not release GPU memory until all jobs have finished running.
@sizer74 - interesting. In that case, at the end of the script, I would call DALI API to release the allocated memory as long as the pipeline is destroyed before that call.