pyinstrument icon indicating copy to clipboard operation
pyinstrument copied to clipboard

CUDA Out Of Memory issue

Open emil-peters opened this issue 1 year ago • 3 comments

As far as I understand it, and during some testing I kept on getting Cuda OOM errors while running code with pyinstrument where multiple models were run one after another. While making sure there was no reference kept to the tensors in the python code, I kept on getting CUDA OOM errors when using pyinstrument. But once disabled the errors disappeared and my VRAM reset as expected after each reference was deleted.

Is there an option to ensure pyinstrument clears its references to onnx and torch tensors, especially after calling del tensor. As I'd like to keep using pyinstrument but it's not feasible atm.

  • Emil

emil-peters avatar Oct 28 '24 15:10 emil-peters

I have a similar problem where a relatively heavy object is not garbage collected when I leave the context, even with del (python 3.12, interval = 0.1). The growth shows rather starkly on tracemalloc, with the number of objects growing by exactly the number of instantiations (or a multiple of). This results in an OOM of the whole process after a few minutes. Such behavior only occurs when using pyinstrument, the RAM usage staying stable with any other profiler. I have been using pyinstrument for years and I don't recall such a problem before (perhaps with changing from 3.7 to 3.12?). Might be related to #296.

Aedial avatar Nov 06 '24 13:11 Aedial

I'm encountering a similar problem. I tracked it down to calls to output_html.

        profiler.stop()
        profiler.output_html()
        profiler.reset()

Using 4.6.2, memory usage (max RSS) climbs ~2MB over 100 profiling sessions. Using 5.0.0, memory usage climbs ~40MB for the same number of sessions.

If I comment out the call to output_html, the memory stays steady

davidemassarenti-optio3 avatar Nov 07 '24 03:11 davidemassarenti-optio3

As far as I understand it, and during some testing I kept on getting Cuda OOM errors while running code with pyinstrument where multiple models were run one after another. While making sure there was no reference kept to the tensors in the python code, I kept on getting CUDA OOM errors when using pyinstrument. But once disabled the errors disappeared and my VRAM reset as expected after each reference was deleted.

Is there an option to ensure pyinstrument clears its references to onnx and torch tensors, especially after calling del tensor. As I'd like to keep using pyinstrument but it's not feasible atm.

  • Emil

I am facing the same question. The code uses torch gpu runs well with python, but encounters torch.OutOfMemoryError: CUDA out of memory. when starts with pyinstrument.

xiaobanni avatar Nov 18 '24 13:11 xiaobanni