cpython icon indicating copy to clipboard operation
cpython copied to clipboard

typing.py: builtin LRU caches leak references

Open wjakob opened this issue 2 years ago • 5 comments

Bug report

I would like to report a refleak issue involving typing.py. The issue is that it internally uses LRU caches to cache certain type-related lookups, and these caches are not cleaned up when the Python interpreter shuts down. This causes leaks that impede software development.

This specific part of typing.py has already once been identified as a source of refleaks by @gvanrossum (context: https://bugs.python.org/issue28649).

The following provides a small reproducer via a trivial package (https://github.com/wjakob/typing_repro) that exposes a class named A using nanobind. Why nanobind? It is extremely paranoid about any leaks involving bound types, functions, and instances, and prints warning messages to tell the user about this after the interpreter has shut down (it performs checks following finalization using Py_AtExit()).

preparation:

$ pip install git+https://github.com/wjakob/typing_repro

Reproducer:

from typing_repro import A

import pandas
import typing

def test(t: typing.Optional[A] = None):
    print(t)

Running this yields

nanobind: leaked 1 types!                                                                                         
 - leaked type "A"                                                                                                
nanobind: leaked 2 functions!
 - leaked function "add"
 - leaked function "__init__"
nanobind: this is likely caused by a reference counting issue in the binding code.

Note the import of pandas, which serves the role of a bigger package that populates the LRU caches. torch (PyTorch) also works, others likely as well.

Removing the test() function or removing the type annotation fixes the issue. The problem is that declaration causes cache entries to be created that are never cleaned up, even when the interpreter finalizes.

There is another way to avoid the issue: at the bottom of the script, insert

for cleanup in typing._cleanups:
    cleanup()

which clears the LRU caches in typing.py. Poof, errors gone. This leads me to suggest the following simple fix, to be added at the end of typing.py:

import atexit

def cleanup():
    import typing
    for cleanup_func in typing._cleanups:
        cleanup_func()

atexit.register(cleanup)

This will clear the caches and ensure that interpreter finalization can avoid those type annotation-related leaks.

Your environment

  • CPython versions tested on: 3.8.10 and 3.10.;7
  • Operating system and architecture: Linux and macOS

wjakob avatar Oct 13 '22 22:10 wjakob

I'm not overly familiar with memory management around the shutdown sequence, but shouldn't these caches get collected automatically when the module object is deallocated and its dict is cleared? Or should every function that uses @lru_cache register an atexit handler to clear the cache?

JelleZijlstra avatar Oct 14 '22 01:10 JelleZijlstra

I honestly have no idea, but if this really was a problem I'd also expect to have heard of this before. Possibly nanobind's check runs before the typing module is cleaned up? Note that the OP appears to be the author of nanobind.

gvanrossum avatar Oct 14 '22 03:10 gvanrossum

What is potentially confusing is that there are two different kinds of atexit() at play in Python.

  1. nanobind's exit handler is installed via Py_AtExit(). This is a list of up to 32 function pointers that are run after the interpreter is already fully shut down. Most Python API calls including garbage collection are no longer possible at this point.

    Because nanobind maintains some internal data structures for every bound type outside of Python, it knows if its types have all been fully garbage collected. This has been invaluable in tracking down bugs in bindings, and the library is therefore intentionally very noisy upon noticing leaks.

  2. Then, there is the Python atexit module that can be registered from within Python. Those run while the interpreter is still alive. The hack/workaround I suggested above uses that principle to clear the LRU caches in typing.py.

In principle, I agree with @JelleZijlstra's sentiment: the references should be known to Python's cyclic GC, which can then clean things up (including LRU caches related to type annotations). But somehow this is not happening.

wjakob avatar Oct 14 '22 07:10 wjakob

There really does seem to be something broken specifically with the LRU cache and garbage collection.

Python's codebase contains several places where code must manually clean the caches to avoid leaks when running the tests in refleak-hunting mode.

https://github.com/python/cpython/blob/main/Lib/test/libregrtest/utils.py#L211 https://github.com/python/cpython/blob/main/Lib/test/test_typing.py#L74 https://github.com/python/cpython/blob/main/Lib/test/test_typing.py#L5584

Some more context on these changes is given in the 2016 bugtracker issue: https://bugs.python.org/issue28649.

wjakob avatar Oct 14 '22 07:10 wjakob

@wjakob Do your observations persist when using the pure Python version of the lru_cache instead of the C version?

The C version fully participates in GC and should behave like any other container (i.e. we don't normally clear every list, dict, and set prior to shutdown). If the C version is suspected to be buggy, here are some leads that we can follow:

https://mail.python.org/archives/list/[email protected]/thread/C4ILXGPKBJQYUN5YDMTJOEOX7RHOD4S3/ https://bugs.python.org/issue42536 https://bugs.python.org/issue35780

rhettinger avatar Oct 15 '22 22:10 rhettinger

@rhettinger: I am not very familiar with the implementation of the LRU cache but tried the following -- please let me know if that's wrong. Specifically, I went into functools.py and commented out the lines

try:
    from _functools import _lru_cache_wrapper
except ImportError:
    pass

under the assumption that this is what it takes for the cache to switch over to the pure Python version. I observe that it then uses the Python version of the _lru_cache_wrapper function defined in the same file. However, this did not fix the leak:

nanobind: leaked 1 types!
 - leaked type "typing_repro.typing_repro_ext.A"
nanobind: leaked 2 functions!
 - leaked function "add"
 - leaked function "__init__"
nanobind: this is likely caused by a reference counting issue in the binding code.

The issue should be very easy to reproduce with the 5-LOC script in the first post. Does it happen on your end?

wjakob avatar Oct 19 '22 22:10 wjakob

I can reproduce the nanobind message with your sample script, but the leak apparently no longer happens when I comment out import pandas. pandas is a big pile of code, so perhaps whatever leak nanobind detects is actually in pandas?

JelleZijlstra avatar Oct 19 '22 22:10 JelleZijlstra

It happens with other big packages as well (for example, try PyTorch or Tensorflow). What I can see is that the LRU caches are filled with lots of data when pandas is imported, and that seems to play a role. I could see how pandas maybe has some reference counting issues internally, but I don't see how that would cause my own types to no longer be deallocated by typing.py.

The caches are definitely implicated in some form -- adding the following code to the reproducer fixes the issue for example:

import atexit

def cleanup():
    import typing
    for cleanup_func in typing._cleanups:
        cleanup_func()

atexit.register(cleanup)

wjakob avatar Oct 19 '22 22:10 wjakob

I tried to narrow it down by removing parts of pandas's __init__.py, and found that the reported leak goes away when the pandas._libs.json module is no longer imported. That's a Cython module with no apparent relevance to types; I think its code is here.

JelleZijlstra avatar Oct 19 '22 23:10 JelleZijlstra

Ok, I think I found a smoking gun.

Here is another, much smaller extension, that also produces a type leak:

from typing_repro import A

import markupsafe
import typing

def test(t: typing.Optional[A] = None):
    print(t)

produces the dreaded

nanobind: leaked 1 types!
 - leaked type "typing_repro.typing_repro_ext.A"
nanobind: leaked 2 functions!
 - leaked function "__init__"
 - leaked function "add"
nanobind: this is likely caused by a reference counting issue in the binding code.

The markupsafe extension contains a pure Python portion defining a markupsafe.Markup class, which includes type annotations that presumably trigger the LRU cache. But there is also a native C extension markupsafe._speedups, which is tiny, self-contained, and doesn't even use Cython (phew!).

The problem can be tied down to a function that is called as part of the module initialization:

static PyObject* markup;

static int
init_constants(void)
{
	PyObject *module;

	/* import markup type so that we can mark the return value */
	module = PyImport_ImportModule("markupsafe");
	if (!module)
		return 0;
	markup = PyObject_GetAttrString(module, "Markup");
	Py_DECREF(module);

	return 1;
}

/* removed lots of stuff ... */

PyMODINIT_FUNC
PyInit__speedups(void)
{
	if (!init_constants())
		return NULL;

	return PyModule_Create(&module_definition);
}

What's the problem? This extension module internally stashes a reference to the Markup class so that it can be used for internal purposes later on. It also implicitly increases the reference count via the PyObject_GetAttrString method. Adding Py_DECREF(markup); causes the warning to go away. So this is at the end of the day a simple reference leak within the markupsafe package.

HOWEVER: I don't think it is reasonable that this also causes other heap types to leak all across the Python interpreter. And it is specifically typing and the builtin LRU cache that are to blame for this.

wjakob avatar Oct 20 '22 11:10 wjakob

I see are potential workarounds. First, the LRU cache used by typing could potentially be modified so that it holds weak references that don't cause leaks across the program. However, I am not sure how practical that is.

The second one is the band-aid I suggested. The following could be added somewhere in typing.py and would take care of the leak at shutdown time.

def _cleanup_handler():
    for f in _cleanups:
        f()

import atexit as _atexit
_atexit.register(_cleanup_handler)

wjakob avatar Oct 20 '22 14:10 wjakob

Good catch! So the leak chain is markupsafe.Markup -> one of its method has an Optional[str] annotation -> Optional[str] resolves to an instance of typing._UnionGenericAlias -> typing._UnionGenericAlias.__getitem__ holds on to its LRU cache -> that cache contains a reference to typing_repro.A.

I suppose we could put in the cleanup you recommend, but wouldn't it be better to fix the third-party packages that have this bug?

JelleZijlstra avatar Oct 21 '22 00:10 JelleZijlstra

I suppose we could put in the cleanup you recommend, but wouldn't it be better to fix the third-party packages that have this bug?

It would not fix the issue in general – as my tests above showed (try, e.g. replacing markupsafe by pandas, tensorflow, pytorch, etc.) this pattern of extension modules holding a reference to a type is very common. It's a benign kind of leak that does not bother anyone else.

What the LRU cache does is to spread this benign leak into a web of leaks involving anything else that also uses typing. That, I think is the gist of the issue.

wjakob avatar Oct 21 '22 06:10 wjakob

Here is a proposed change to typing.py that fixes this behavior, using yet another approach. The idea is to break the reference chain at the typing._UnionGenericAlias.__getitem__ -> lru_cache segment, by using the function as a key into a global dict storing all of the used LRU caches.

--- typing.py   2022-10-21 08:36:47.000000000 +0200
+++ typing.py   2022-10-21 08:36:54.000000000 +0200
@@ -293,6 +293,7 @@
 
 
 _cleanups = []
+_caches = { }
 
 
 def _tp_cache(func=None, /, *, typed=False):
@@ -300,13 +301,15 @@
     original function for non-hashable arguments.
     """
     def decorator(func):
-        cached = functools.lru_cache(typed=typed)(func)
-        _cleanups.append(cached.cache_clear)
+        cache = functools.lru_cache(typed=typed)(func)
+        _caches[func] = cache 
+        _cleanups.append(cache.cache_clear)
+        del cache
 
         @functools.wraps(func)
         def inner(*args, **kwds):
             try:
-                return cached(*args, **kwds)
+                return _caches[func](*args, **kwds)
             except TypeError:
                 pass  # All real errors (not unhashable args) are raised below.
             return func(*args, **kwds)

I tested this change, and it fixes the refleak issue on my end.

wjakob avatar Oct 21 '22 06:10 wjakob

Just to reiterate so that I don't get lost in the weeds here:

  1. typing.py's LRU cache itself doesn't leak references. It just holds onto other things which may leak references or is held by said other things.
  2. Those other leaky things may hold a circular reference to typing.py's LRU cache through indirect means that are naturally found in typing.py.
  3. So it's proposed to break the reference cycle on typing.py's end, instead of getting the leaky extension to do it.

Is this correct?

Fidget-Spinner avatar Oct 25 '22 12:10 Fidget-Spinner

Yes, that summarizes things well. I would add that typing is an elementary part of the Python ecosystem, which includes many compiled extension with individually benign type leaks that now cause issues through the interaction with typing. Expecting them to be fixed instead would seem to me like an impossible game of whack-a-mole.

wjakob avatar Oct 25 '22 13:10 wjakob

Note that the repro no longer works at the moment because nanobind 0.0.8 contains a workaround for this leak. I proposed wjakob/typing_repro#3 to "fix" it. In the meantime, to use the repro, first install nanobind 0.0.7 and the package's other build requirements, then install typing_repro as in the OP above but with --no-build-isolation.

JelleZijlstra avatar Nov 05 '22 16:11 JelleZijlstra

Thanks for the PR and for being patient, @wjakob

AlexWaygood avatar Nov 30 '22 08:11 AlexWaygood

Thanks for accepting my patch :-)

wjakob avatar Dec 02 '22 13:12 wjakob