ANTsPy icon indicating copy to clipboard operation
ANTsPy copied to clipboard

potential bug: reference leak when torch is loaded and a function errors

Open ncullen93 opened this issue 1 year ago • 9 comments

When torch is loaded and a function errors, there is a reference leak reported when the ipython console is exited. This does not seem to happen in a normal python console and only when torch is loaded. Very weird.. I've seen this a few other times but never reproducible so unsure if it's the same issue or if there's a general reference leak.

import ants
import torch
img = ants.image_read(ants.get_data('r16'))
img.crop_indices([0,0],[10,1000])

Then exit() and you get this:

nanobind: leaked 1 instances!
 - leaked instance 0x13362a208 of type "AntsImageF2"
nanobind: leaked 1 types!
 - leaked type "ants.lib.AntsImageF2"
nanobind: leaked 1 functions!
 - leaked function "cropImage"
nanobind: this is likely caused by a reference counting issue in the binding code.

ncullen93 avatar May 26 '24 19:05 ncullen93

Seems to be something with pytorch and third-party modules generally - e.g. https://github.com/python/cpython/issues/98253. Probably not a big issue but still worrying.

See also nanobind FAQ on the issue https://nanobind.readthedocs.io/en/latest/faq.html#why-am-i-getting-errors-about-leaked-functions-and-types

ncullen93 avatar May 26 '24 20:05 ncullen93

Got the same issue and memory keeps building up until shutting down the program. Found a previous issue https://github.com/ANTsX/ANTsPy/issues/117, and it seems that their example code also has memory leaks in 0.5.4.

antsRegistration -d 2 -r [0x13fc74ce8,0x13fc74d08,1] -m mattes[0x13fc74ce8,0x13fc74d08,1,32,regular,0.2] -t Affine[0.25] -c 2100x1200x1200x0 -s 3x2x1x0 -f 4x2x2x1 -x [NA,NA] -m mattes[0x13fc74ce8,0x13fc74d08,1,32] -t SyN[0.200000,3.000000,0.000000] -c [40x20x0,1e-7,8] -s 2x1x0 -f 4x2x1 -u 1 -z 1 -o [/var/folders/bs/n0q8wqv931g89ppshhgp2m2m0000gn/T//RtmpbpYhz2/filea3bb284caaea,0x13fc74ca8,0x13fc74cc8] -x [NA,NA] --float 1 --write-composite-transform 0 -v 1
nanobind: leaked 9 instances!
 - leaked instance 0x13fc74a08 of type "ants.lib.AntsImageF2"
 - leaked instance 0x13fc749c8 of type "ants.lib.AntsTransformF22"
 - leaked instance 0x13fc74be8 of type "ants.lib.AntsImageF2"
 - leaked instance 0x13fc747a8 of type "ants.lib.AntsImageF2"
 - leaked instance 0x13fc74828 of type "ants.lib.AntsImageF2"
 - leaked instance 0x13fc74ae8 of type "ants.lib.AntsImageF2"
 - leaked instance 0x13fc74a68 of type "ants.lib.AntsTransformF22"
 - leaked instance 0x13fc747c8 of type "ants.lib.AntsImageF2"
 - leaked instance 0x13dd27828 of type "ants.lib.AntsTransformF33"
nanobind: leaked 3 types!
 - leaked type "ants.lib.AntsImageF2"
 - leaked type "ants.lib.AntsTransformF33"
 - leaked type "ants.lib.AntsTransformF22"
nanobind: this is likely caused by a reference counting issue in the binding code.
>>> ants.__version__
'0.5.4'

dipterix avatar Nov 08 '24 17:11 dipterix

Got the same issue and memory keeps building up until shutting down the program. Found a previous issue https://github.com/ANTsX/ANTsPy/issues/117, and it seems that their example code also has memory leaks in 0.5.4.

Can you please post a reproducible example?

cookpa avatar Nov 08 '24 18:11 cookpa

There was also a report of something similar in #678 - the user appears to have deleted their comments but it was something about running registrations in a loop. I tried reporting memory usage in a loop but it appeared to increase a lot then only very slightly after the first iteration. Hard to tell if it was a small leak or just background optimization processes. Or maybe Python was not reporting memory correctly.

Will definitely investigate further if there's an example, ideally using built-in antspy data or failing that some other public data

cookpa avatar Nov 08 '24 18:11 cookpa

I can't get the torch warnings on my Intel Mac

Python 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 15:55:29) [Clang 17.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ants
>>> import torch
>>> img = ants.image_read(ants.get_data('r16'))
>>> img.crop_indices([0,0],[10,1000])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniforge/base/envs/antspy_torch/lib/python3.12/site-packages/ants/decorators.py", line 7, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Caskroom/miniforge/base/envs/antspy_torch/lib/python3.12/site-packages/ants/ops/crop_image.py", line 110, in crop_indices
    itkimage = libfn(image.pointer, image.pointer, 1, 2, lowerind, upperind)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: /Users/runner/work/ANTsPy/ANTsPy/itksource/Modules/Core/Common/src/itkDataObject.cxx:367:
Requested region is (at least partially) outside the largest possible region.
>>> exit()

torch == 2.2.2 antspyx == 0.5.4

cookpa avatar Nov 13 '24 14:11 cookpa

Leaving this thread for the torch issue. For general memory leak problems, please see #733.

cookpa avatar Nov 13 '24 17:11 cookpa

For some reasons when I run in bare python in terminal, it doesn't complain about the memory leaks. However, the memory leak still persists. For example, the following example keeps occupying more memories (activity monitor shows 7GB RAM usage after a few hundred iters. However, the diff of memory table shows nothing (signs of mem leak).

My suspect of nanobind warning is that because we use python IDE, and IDE keeps a weak ref to the python objects. If the object is not cleared on exit, then nanobind will raise the warning. I'll try to capture a reproducible example when I get back.

import ants
import numpy as np

img = np.random.random((400, 400))
img_ants = ants.image_clone(ants.from_numpy(img), pixeltype='float')

from pympler.tracker import SummaryTracker
tracker = SummaryTracker()


for i in range(5000):
  ants.image_similarity(img_ants, img_ants, metric_type='Correlation')
  if i % 100 == 0:
    # expect no change in memory table, but increased memory usage
    print(tracker.print_diff())

dipterix avatar Nov 13 '24 17:11 dipterix

Leaving this thread for the torch issue. For general memory leak problems, please see #733.

Oh just saw your message. Thanks for reference. I'll dig more into this (indeed we need reproducible example but this is tricky)

dipterix avatar Nov 13 '24 17:11 dipterix

Thanks @dipterix , this tracks with what I found previously - Python doesn't think it's using the memory, I think there's leftover C++ objects.

cookpa avatar Nov 13 '24 20:11 cookpa