cpython
cpython copied to clipboard
typing.py: builtin LRU caches leak references
Bug report
I would like to report a refleak issue involving typing.py
. The issue is that it internally uses LRU caches to cache certain type-related lookups, and these caches are not cleaned up when the Python interpreter shuts down. This causes leaks that impede software development.
This specific part of typing.py
has already once been identified as a source of refleaks by @gvanrossum (context: https://bugs.python.org/issue28649).
The following provides a small reproducer via a trivial package (https://github.com/wjakob/typing_repro) that exposes a class named A
using nanobind
. Why nanobind
? It is extremely paranoid about any leaks involving bound types, functions, and instances, and prints warning messages to tell the user about this after the interpreter has shut down (it performs checks following finalization using Py_AtExit()
).
preparation:
$ pip install git+https://github.com/wjakob/typing_repro
Reproducer:
from typing_repro import A
import pandas
import typing
def test(t: typing.Optional[A] = None):
print(t)
Running this yields
nanobind: leaked 1 types!
- leaked type "A"
nanobind: leaked 2 functions!
- leaked function "add"
- leaked function "__init__"
nanobind: this is likely caused by a reference counting issue in the binding code.
Note the import of pandas
, which serves the role of a bigger package that populates the LRU caches. torch
(PyTorch) also works, others likely as well.
Removing the test()
function or removing the type annotation fixes the issue. The problem is that declaration causes cache entries to be created that are never cleaned up, even when the interpreter finalizes.
There is another way to avoid the issue: at the bottom of the script, insert
for cleanup in typing._cleanups:
cleanup()
which clears the LRU caches in typing.py
. Poof, errors gone. This leads me to suggest the following simple fix, to be added at the end of typing.py
:
import atexit
def cleanup():
import typing
for cleanup_func in typing._cleanups:
cleanup_func()
atexit.register(cleanup)
This will clear the caches and ensure that interpreter finalization can avoid those type annotation-related leaks.
Your environment
- CPython versions tested on: 3.8.10 and 3.10.;7
- Operating system and architecture: Linux and macOS
I'm not overly familiar with memory management around the shutdown sequence, but shouldn't these caches get collected automatically when the module object is deallocated and its dict is cleared? Or should every function that uses @lru_cache
register an atexit handler to clear the cache?
I honestly have no idea, but if this really was a problem I'd also expect to have heard of this before. Possibly nanobind's check runs before the typing module is cleaned up? Note that the OP appears to be the author of nanobind.
What is potentially confusing is that there are two different kinds of atexit()
at play in Python.
-
nanobind
's exit handler is installed viaPy_AtExit()
. This is a list of up to 32 function pointers that are run after the interpreter is already fully shut down. Most Python API calls including garbage collection are no longer possible at this point.Because nanobind maintains some internal data structures for every bound type outside of Python, it knows if its types have all been fully garbage collected. This has been invaluable in tracking down bugs in bindings, and the library is therefore intentionally very noisy upon noticing leaks.
-
Then, there is the Python
atexit
module that can be registered from within Python. Those run while the interpreter is still alive. The hack/workaround I suggested above uses that principle to clear the LRU caches intyping.py
.
In principle, I agree with @JelleZijlstra's sentiment: the references should be known to Python's cyclic GC, which can then clean things up (including LRU caches related to type annotations). But somehow this is not happening.
There really does seem to be something broken specifically with the LRU cache and garbage collection.
Python's codebase contains several places where code must manually clean the caches to avoid leaks when running the tests in refleak-hunting mode.
https://github.com/python/cpython/blob/main/Lib/test/libregrtest/utils.py#L211 https://github.com/python/cpython/blob/main/Lib/test/test_typing.py#L74 https://github.com/python/cpython/blob/main/Lib/test/test_typing.py#L5584
Some more context on these changes is given in the 2016 bugtracker issue: https://bugs.python.org/issue28649.
@wjakob Do your observations persist when using the pure Python version of the lru_cache instead of the C version?
The C version fully participates in GC and should behave like any other container (i.e. we don't normally clear every list, dict, and set prior to shutdown). If the C version is suspected to be buggy, here are some leads that we can follow:
https://mail.python.org/archives/list/[email protected]/thread/C4ILXGPKBJQYUN5YDMTJOEOX7RHOD4S3/ https://bugs.python.org/issue42536 https://bugs.python.org/issue35780
@rhettinger: I am not very familiar with the implementation of the LRU cache but tried the following -- please let me know if that's wrong. Specifically, I went into functools.py
and commented out the lines
try:
from _functools import _lru_cache_wrapper
except ImportError:
pass
under the assumption that this is what it takes for the cache to switch over to the pure Python version. I observe that it then uses the Python version of the _lru_cache_wrapper
function defined in the same file. However, this did not fix the leak:
nanobind: leaked 1 types!
- leaked type "typing_repro.typing_repro_ext.A"
nanobind: leaked 2 functions!
- leaked function "add"
- leaked function "__init__"
nanobind: this is likely caused by a reference counting issue in the binding code.
The issue should be very easy to reproduce with the 5-LOC script in the first post. Does it happen on your end?
I can reproduce the nanobind message with your sample script, but the leak apparently no longer happens when I comment out import pandas
. pandas is a big pile of code, so perhaps whatever leak nanobind detects is actually in pandas?
It happens with other big packages as well (for example, try PyTorch or Tensorflow). What I can see is that the LRU caches are filled with lots of data when pandas
is imported, and that seems to play a role. I could see how pandas maybe has some reference counting issues internally, but I don't see how that would cause my own types to no longer be deallocated by typing.py
.
The caches are definitely implicated in some form -- adding the following code to the reproducer fixes the issue for example:
import atexit
def cleanup():
import typing
for cleanup_func in typing._cleanups:
cleanup_func()
atexit.register(cleanup)
I tried to narrow it down by removing parts of pandas's __init__.py
, and found that the reported leak goes away when the pandas._libs.json
module is no longer imported. That's a Cython module with no apparent relevance to types; I think its code is here.
Ok, I think I found a smoking gun.
Here is another, much smaller extension, that also produces a type leak:
from typing_repro import A
import markupsafe
import typing
def test(t: typing.Optional[A] = None):
print(t)
produces the dreaded
nanobind: leaked 1 types!
- leaked type "typing_repro.typing_repro_ext.A"
nanobind: leaked 2 functions!
- leaked function "__init__"
- leaked function "add"
nanobind: this is likely caused by a reference counting issue in the binding code.
The markupsafe
extension contains a pure Python portion defining a markupsafe.Markup
class, which includes type annotations that presumably trigger the LRU cache. But there is also a native C extension markupsafe._speedups
, which is tiny, self-contained, and doesn't even use Cython (phew!).
The problem can be tied down to a function that is called as part of the module initialization:
static PyObject* markup;
static int
init_constants(void)
{
PyObject *module;
/* import markup type so that we can mark the return value */
module = PyImport_ImportModule("markupsafe");
if (!module)
return 0;
markup = PyObject_GetAttrString(module, "Markup");
Py_DECREF(module);
return 1;
}
/* removed lots of stuff ... */
PyMODINIT_FUNC
PyInit__speedups(void)
{
if (!init_constants())
return NULL;
return PyModule_Create(&module_definition);
}
What's the problem? This extension module internally stashes a reference to the Markup
class so that it can be used for internal purposes later on. It also implicitly increases the reference count via the PyObject_GetAttrString
method. Adding Py_DECREF(markup);
causes the warning to go away. So this is at the end of the day a simple reference leak within the markupsafe
package.
HOWEVER: I don't think it is reasonable that this also causes other heap types to leak all across the Python interpreter. And it is specifically typing
and the builtin LRU cache that are to blame for this.
I see are potential workarounds. First, the LRU cache used by typing
could potentially be modified so that it holds weak references that don't cause leaks across the program. However, I am not sure how practical that is.
The second one is the band-aid I suggested. The following could be added somewhere in typing.py
and would take care of the leak at shutdown time.
def _cleanup_handler():
for f in _cleanups:
f()
import atexit as _atexit
_atexit.register(_cleanup_handler)
Good catch! So the leak chain is markupsafe.Markup
-> one of its method has an Optional[str]
annotation -> Optional[str]
resolves to an instance of typing._UnionGenericAlias
-> typing._UnionGenericAlias.__getitem__
holds on to its LRU cache -> that cache contains a reference to typing_repro.A
.
I suppose we could put in the cleanup you recommend, but wouldn't it be better to fix the third-party packages that have this bug?
I suppose we could put in the cleanup you recommend, but wouldn't it be better to fix the third-party packages that have this bug?
It would not fix the issue in general – as my tests above showed (try, e.g. replacing markupsafe
by pandas
, tensorflow
, pytorch
, etc.) this pattern of extension modules holding a reference to a type is very common. It's a benign kind of leak that does not bother anyone else.
What the LRU cache does is to spread this benign leak into a web of leaks involving anything else that also uses typing
. That, I think is the gist of the issue.
Here is a proposed change to typing.py
that fixes this behavior, using yet another approach. The idea is to break the reference chain at the typing._UnionGenericAlias.__getitem__
-> lru_cache
segment, by using the function as a key into a global dict storing all of the used LRU caches.
--- typing.py 2022-10-21 08:36:47.000000000 +0200
+++ typing.py 2022-10-21 08:36:54.000000000 +0200
@@ -293,6 +293,7 @@
_cleanups = []
+_caches = { }
def _tp_cache(func=None, /, *, typed=False):
@@ -300,13 +301,15 @@
original function for non-hashable arguments.
"""
def decorator(func):
- cached = functools.lru_cache(typed=typed)(func)
- _cleanups.append(cached.cache_clear)
+ cache = functools.lru_cache(typed=typed)(func)
+ _caches[func] = cache
+ _cleanups.append(cache.cache_clear)
+ del cache
@functools.wraps(func)
def inner(*args, **kwds):
try:
- return cached(*args, **kwds)
+ return _caches[func](*args, **kwds)
except TypeError:
pass # All real errors (not unhashable args) are raised below.
return func(*args, **kwds)
I tested this change, and it fixes the refleak issue on my end.
Just to reiterate so that I don't get lost in the weeds here:
- typing.py's LRU cache itself doesn't leak references. It just holds onto other things which may leak references or is held by said other things.
- Those other leaky things may hold a circular reference to typing.py's LRU cache through indirect means that are naturally found in typing.py.
- So it's proposed to break the reference cycle on typing.py's end, instead of getting the leaky extension to do it.
Is this correct?
Yes, that summarizes things well. I would add that typing
is an elementary part of the Python ecosystem, which includes many compiled extension with individually benign type leaks that now cause issues through the interaction with typing
. Expecting them to be fixed instead would seem to me like an impossible game of whack-a-mole.
Note that the repro no longer works at the moment because nanobind 0.0.8 contains a workaround for this leak. I proposed wjakob/typing_repro#3 to "fix" it. In the meantime, to use the repro, first install nanobind 0.0.7 and the package's other build requirements, then install typing_repro
as in the OP above but with --no-build-isolation
.
Thanks for the PR and for being patient, @wjakob
Thanks for accepting my patch :-)