pyro
pyro copied to clipboard
Memory leak using TraceEnum_ELBO
I noticed a major memory leak when training SVI using TraceEnum_ELBO.
I initially noticed this in a custom model we are developing but then I found it seems a more general bug.
For example, it affects even the Pyro tutorials GMM example here. Where memory usage rapidly goes from a couple of hundred MBs to a many GBs very quickly!
I have run this Macbook Pro 2019 running MacOS 10.15. To replicate the issue is enough running the notebook linked.
I have tried to comment out the following lines and add a garbage collector call, that reduces the entity of the memory accumulation of one order of magnitude but does not solve the problem completely, which becomes particularly severe for large datasets.
# Register hooks to monitor gradient norms.
# gradient_norms = defaultdict(list)
# for name, value in pyro.get_param_store().named_parameters():
# value.register_hook(lambda g, name=name: gradient_norms[name].append(g.norm().item()))
import gc
losses = []
for i in range(200000):
loss = svi.step(data)
#losses.append(loss)
gc.collect()
(from this forum post)
@gioelelm could you check to see if #3069 fixes your issue?
Thank you for the quick attempt, but no it does not fix the problem. Neither the one without using gc.collect() nor the residual leak when garbage collecting.
Thanks for checking @gioelelm. I might have time in the next few weeks to dive deeper. If you have time I can recommend some strategies (what I'd try):
- get an idea of which tensors are leaking using this trick
- try to determine which objects might be holding references to the leaking tensors using something like
elbo = TraceEnum_ELBO() optim = ClippedAdam(...) svi = SVI(model, guide, optim, elbo) for step in range(steps): svi.step() print("svi", len(pickle.dumps(svi))) print("elbo", len(pickle.dumps(elbo))) print("optim", len(pickle.dumps(optim))) print("param_store", len(pickle.dumps(pyro.get_param_store())) - See if this is a recent PyTorch bug by trying inference with different torch versions, say 1.11, 1.10, 1.9. 1.8. I'm pretty sure the GMM tutorial should still work with older PyTorch versions.
Ok thanks!
I will try the first two.
Regarding the last point, since the code runs successfully anyways (provided the machine has enough memory). Don't you think that the bug could have gone unnoticed? Or you have some reason to exclude that. I am thinking at the fact that on one would have had to profile the memory usage of the program to figure that there was a problem.
Don't you think that the bug could have gone unnoticed?
It could have, but TraceEnum_ELBO is pretty heavily used, and we've done a lot of memory profiling in the past. After working with Pyro and PyTorch for a few years, my posterior is 40% on a recent PyTorch regression, 40% on an edge case memory leak in Pyro that has never been noticed, and 20% on a recently introduced weird interaction between Pyro and PyTorch, so 60% chance this could be narrowed down by searching through PyTorch versions.
I have noticed a major GPU memory leak as well switching from PyTorch 1.10 to 1.11. Wasn't able to debug it and decided to stick to PyTorch 1.10.0 (and Pyro 1.8.0) for now.
Edit: CUDA 11.6, Arch Linux
Hmm maybe we should relax pytorch requirements and release to allow Pyro 1.8.2 to work with PyTorch 1.10. We'd need to do the same with Funsor. I think I was a little too eager dropping PyTorch 1.10 support, especially given colab still uses 1.10.
I have noticed a GPU memory leaks too with Pyro 1.8.1+06911dc and PyTorch 1.11.0. Downgrade to Pyro 1.6.0 and PyTorch 1.8.0 works normally.
Downgrading, as @qinqian suggests, also resolves #3014.