pyro icon indicating copy to clipboard operation
pyro copied to clipboard

Memory leak using TraceEnum_ELBO

Open gioelelm opened this issue 3 years ago • 9 comments
trafficstars

I noticed a major memory leak when training SVI using TraceEnum_ELBO. I initially noticed this in a custom model we are developing but then I found it seems a more general bug.

For example, it affects even the Pyro tutorials GMM example here. Where memory usage rapidly goes from a couple of hundred MBs to a many GBs very quickly!

I have run this Macbook Pro 2019 running MacOS 10.15. To replicate the issue is enough running the notebook linked.

I have tried to comment out the following lines and add a garbage collector call, that reduces the entity of the memory accumulation of one order of magnitude but does not solve the problem completely, which becomes particularly severe for large datasets.

# Register hooks to monitor gradient norms.
# gradient_norms = defaultdict(list)
# for name, value in pyro.get_param_store().named_parameters():
#     value.register_hook(lambda g, name=name: gradient_norms[name].append(g.norm().item()))

import gc
losses = []
for i in range(200000):
    loss = svi.step(data)
    #losses.append(loss)
    gc.collect()

(from this forum post)

gioelelm avatar Apr 14 '22 14:04 gioelelm

@gioelelm could you check to see if #3069 fixes your issue?

fritzo avatar Apr 14 '22 14:04 fritzo

Thank you for the quick attempt, but no it does not fix the problem. Neither the one without using gc.collect() nor the residual leak when garbage collecting.

gioelelm avatar Apr 14 '22 14:04 gioelelm

Thanks for checking @gioelelm. I might have time in the next few weeks to dive deeper. If you have time I can recommend some strategies (what I'd try):

  • get an idea of which tensors are leaking using this trick
  • try to determine which objects might be holding references to the leaking tensors using something like
    elbo = TraceEnum_ELBO()
    optim = ClippedAdam(...)
    svi = SVI(model, guide, optim, elbo)
    for step in range(steps):
        svi.step()
        print("svi", len(pickle.dumps(svi)))
        print("elbo", len(pickle.dumps(elbo)))
        print("optim", len(pickle.dumps(optim)))
        print("param_store", len(pickle.dumps(pyro.get_param_store()))
    
  • See if this is a recent PyTorch bug by trying inference with different torch versions, say 1.11, 1.10, 1.9. 1.8. I'm pretty sure the GMM tutorial should still work with older PyTorch versions.

fritzo avatar Apr 14 '22 16:04 fritzo

Ok thanks!

I will try the first two.

Regarding the last point, since the code runs successfully anyways (provided the machine has enough memory). Don't you think that the bug could have gone unnoticed? Or you have some reason to exclude that. I am thinking at the fact that on one would have had to profile the memory usage of the program to figure that there was a problem.

gioelelm avatar Apr 14 '22 17:04 gioelelm

Don't you think that the bug could have gone unnoticed?

It could have, but TraceEnum_ELBO is pretty heavily used, and we've done a lot of memory profiling in the past. After working with Pyro and PyTorch for a few years, my posterior is 40% on a recent PyTorch regression, 40% on an edge case memory leak in Pyro that has never been noticed, and 20% on a recently introduced weird interaction between Pyro and PyTorch, so 60% chance this could be narrowed down by searching through PyTorch versions.

fritzo avatar Apr 14 '22 17:04 fritzo

I have noticed a major GPU memory leak as well switching from PyTorch 1.10 to 1.11. Wasn't able to debug it and decided to stick to PyTorch 1.10.0 (and Pyro 1.8.0) for now.

Edit: CUDA 11.6, Arch Linux

ordabayevy avatar Apr 14 '22 17:04 ordabayevy

Hmm maybe we should relax pytorch requirements and release to allow Pyro 1.8.2 to work with PyTorch 1.10. We'd need to do the same with Funsor. I think I was a little too eager dropping PyTorch 1.10 support, especially given colab still uses 1.10.

fritzo avatar Apr 14 '22 18:04 fritzo

I have noticed a GPU memory leaks too with Pyro 1.8.1+06911dc and PyTorch 1.11.0. Downgrade to Pyro 1.6.0 and PyTorch 1.8.0 works normally.

qinqian avatar Apr 27 '22 02:04 qinqian

Downgrading, as @qinqian suggests, also resolves #3014.

OlaRonning avatar Apr 28 '22 09:04 OlaRonning