astroid
astroid copied to clipboard
Astroid is leaking memory
Astroid is leaking memory when is run multiple times over same file using API
Steps to reproduce
- Pick larger file, e.g. file from pylint project: pylint/checkers/variables.py
- Run following script:
from pylint.lint import Run
from pylint.lint import pylinter
while True:
Run(["pylint/checkers/variables.py"], exit=False)
pylinter.MANAGER.clear_cache()
Current behavior
After ~12 iteration memory goes over 1 GB and keeps growing
Expected behavior
Allocated memory should not rise indefinitelly
python -c "from astroid import __pkginfo__; print(__pkginfo__.version)"
output
2.13.0-dev0
I did some investigation - I profiled memory using muppy:
from pylint.lint import Run
from pylint.lint import pylinter
from pympler import muppy
from pympler import summary
Run(["pylint/checkers/variables.py"], exit=False)
pylinter.MANAGER.clear_cache()
all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)
for _ in range(15):
Run(["pylint/checkers/variables.py"], exit=False)
pylinter.MANAGER.clear_cache()
all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)
I got following result:
types | # objects | total size
======================================= | =========== | ============
dict | 221794 | 60.35 MB
list | 212895 | 14.44 MB
str | 67663 | 5.82 MB
int | 185941 | 4.97 MB
tuple | 52200 | 2.78 MB
astroid.nodes.node_classes.Name | 39949 | 1.83 MB
astroid.nodes.node_classes.Const | 36755 | 1.68 MB
code | 8153 | 1.56 MB
tokenize.TokenInfo | 16652 | 1.40 MB
type | 1311 | 1.18 MB
method | 13220 | 826.25 KB
astroid.nodes.node_classes.Attribute | 12100 | 567.19 KB
astroid.nodes.node_classes.AssignName | 11793 | 552.80 KB
astroid.nodes.node_classes.Call | 9234 | 432.84 KB
set | 555 | 389.32 KB
types | # objects | total size
===================================================== | =========== | ============
dict | 2599736 | 764.26 MB
list | 2394537 | 171.48 MB
int | 2635012 | 70.37 MB
str | 387148 | 32.07 MB
astroid.nodes.node_classes.Name | 522253 | 23.91 MB
astroid.nodes.node_classes.Const | 375519 | 17.19 MB
method | 201034 | 12.27 MB
tuple | 231507 | 12.12 MB
astroid.nodes.node_classes.Attribute | 175345 | 8.03 MB
astroid.nodes.node_classes.AssignName | 161694 | 7.40 MB
astroid.nodes.node_classes.Call | 127355 | 5.83 MB
astroid.nodes.node_classes.Arguments | 85926 | 3.93 MB
astroid.nodes.scoped_nodes.scoped_nodes.FunctionDef | 84896 | 3.89 MB
FunctionWrapper | 37788 | 3.46 MB
tokenize.TokenInfo | 39786 | 3.34 MB
Clearly majority of allocated data are dictionaries and lists...
It's hard to know what is inside the dict/list without being able to explore the data. But it's probably part of the AST that are not cleaned up properly, right ?
It's hard to know what is inside the dict/list without being able to explore the data.
I am able to get to that information. If you run following script:
from pylint.lint import Run
from pylint.lint import pylinter
from pympler import muppy
from pympler import summary
from pympler import asizeof
Run(["pylint/checkers/variables.py"], exit=False)
pylinter.MANAGER.clear_cache()
all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)
obj = []
for o in all_objects:
if not isinstance(o, dict):
continue
try:
# some dictionary items are raising exception when accessed
_ = str(o)
except Exception:
continue
obj.append(o)
Please notify that cache was cleaned so, in theory the memory should not contain any extra astroid information about AST.
You get a lot of following dictionaries:
>>> pprint(obj[-5:])
[{'col_offset': 29,
'end_col_offset': 33,
'end_lineno': 1674,
'fromlineno': 1674,
'lineno': 1674,
'name': 'node',
'parent': <Attribute.name l.1674 at 0x1125b3490>,
'position': None},
{'col_offset': 37,
'end_col_offset': 42,
'end_lineno': 1436,
'fromlineno': 1436,
'lineno': 1436,
'name': 'nodes',
'parent': <Attribute.FunctionDef l.1436 at 0x112589fd0>,
'position': None},
{'col_offset': 46,
'end_col_offset': 59,
'end_lineno': 1876,
'fromlineno': 1876,
'lineno': 1876,
'name': 'default_names',
'parent': <Call l.1876 at 0x111e92fd0>,
'position': None},
{'col_offset': 64,
'end_col_offset': 68,
'end_lineno': 2105,
'fromlineno': 2105,
'lineno': 2105,
'name': 'node',
'parent': <Attribute.node_ancestors l.2105 at 0x11260cfa0>,
'position': None},
{'col_offset': 31,
'end_col_offset': 39,
'end_lineno': 2105,
'fromlineno': 2105,
'lineno': 2105,
'name': 'ref_node',
'parent': <Attribute.parent l.2105 at 0x11260ce80>,
'position': None}]
>>> len(obj)
216564
>>> len([i for i in obj if 'position' in i])
160789
>>> len([i for i in obj if 'parent' in i])
161191
Could you help me understand where in the code base these dictionaries can be created?
It does look like the AST, so probably astroid.parse
(?) and/or possibly astroid.NodeNG.infer
(?)
Those dictionaries are the arguments that get passed to postinit
functions I think. At least, that is where their reference is according to the gc
module.
I tried to do some digging, but I really have no experience here. The only thing I could come up with is that something is sustaining a reference to a number of nodes.NodeNG
objects which prevent them from being garbage collected.
import gc
from io import StringIO
from pylint.lint import Run, pylinter
from pylint.reporters.text import TextReporter
from pympler import muppy, summary
from astroid import nodes
prev_nodes = 0
prev_dicts = 0
for _ in range(5):
count_nodes = 0
count_dicts = 0
Run(
["astroid/nodes/node_classes.py"], exit=False, reporter=TextReporter(StringIO())
)
pylinter.MANAGER.clear_cache()
gc.collect()
all_objects = muppy.get_objects()
for o in muppy.get_objects():
if isinstance(o, nodes.NodeNG):
count_nodes += 1
elif isinstance(o, dict):
count_dicts += 1
print()
print("Nodes INC:", count_nodes - prev_nodes)
prev_nodes = count_nodes
print("Nodes NEW:", prev_nodes)
print("Dicts INC:", count_dicts - prev_dicts)
prev_dicts = count_dicts
print("Dicts NEW:", prev_dicts)
This shows that the increase in both node and dict objects is constant after initial startup.
❯ python test2.py
Nodes INC: 137841
Nodes NEW: 137841
Dicts INC: 68924
Dicts NEW: 68924
Nodes INC: 112234
Nodes NEW: 250075
Dicts INC: 51192
Dicts NEW: 120116
Nodes INC: 112234
Nodes NEW: 362309
Dicts INC: 51191
Dicts NEW: 171307
Nodes INC: 112234
Nodes NEW: 474543
Dicts INC: 51191
Dicts NEW: 222498
Nodes INC: 112234
Nodes NEW: 586777
Dicts INC: 51191
Dicts NEW: 273689
As far as my understanding of garbage collections goes this show that the persistence of these objects does not come from the garbage collection randomness but from actual references to these objects where they probably shouldn't be referenced.
My first thought would be to make .parent
a weakref
to test whether that would have an impact, but that isn't really feasible... Perhaps somebody has a good strategy to understanding why the gc
doesn't collect certain objects?
My first thought would be to make .parent a weakref to test whether that would have an impact, but that isn't really feasible...
I think that's exactly what's going on. The child nodes are still holding references to the module in .parent
, so when the linter's refcount goes to zero, the module's refcount never has a chance to go to zero also. Seems like a classic use case for weakrefs.
That would be a massive breaking change though... So not sure it is feasible. I would definitely want to help explore it though.
Yep, 3.0 or nothing! ;)
I think I ran into some issues with the WeakRef
PR though as it breaks some use cases. Not sure it would actually work..
music21
has a very similar use case, with a set of weakref tools here, and then (in some cases) properties for syntactic sugar when setting / retrieving parents.
One track to explore: we could make NodeNG.parent
a property, and have its getter and setter call similar utilities as I linked above. The BC implications would be so small we may not even need to wait for 3.0: calling .parent
would still give NodeNG or None.
Sounds like it is worth exploring. Is that something you would want to build?
Yep, I'm already seeing some promising results. Looking to do a sequence of small PRs.
Here's a version of Daniel's script from https://github.com/PyCQA/astroid/issues/1780#issuecomment-1245592744 that takes care to delete the references created in the measurements:
import gc
from pylint.lint import Run
from pympler import muppy
from astroid import nodes
prev_nodes = 0
prev_dicts = 0
for _ in range(5):
count_nodes = 0
count_dicts = 0
Run(
["astroid/nodes/node_classes.py", '--clear-cache-post-run=y'], exit=False
)
gc.collect()
all_objects = muppy.get_objects()
for o in muppy.get_objects():
if isinstance(o, nodes.NodeNG):
count_nodes += 1
elif isinstance(o, dict):
count_dicts += 1
print()
print("Nodes INC:", count_nodes - prev_nodes)
prev_nodes = count_nodes
print("Nodes NEW:", prev_nodes)
print("Dicts INC:", count_dicts - prev_dicts)
prev_dicts = count_dicts
print("Dicts NEW:", prev_dicts)
del all_objects
del o
Looks like we've solved the node leaking, just dicts are left:
$ python3 memory.py
--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)
Nodes INC: 102907
Nodes NEW: 102907
Dicts INC: 65091
Dicts NEW: 65091
--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)
Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11472
Dicts NEW: 76563
--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)
Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11460
Dicts NEW: 88023
--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)
Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11460
Dicts NEW: 99483
--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)
Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11460
Dicts NEW: 110943
The remaining uses seem to derive from the custom caching decorator in astroid. Looking into converting those uses to just cached_property
.
If we want to close this out, here are some next steps for 3.0.
The two remaining offenders are _get_assign_nodes()
and slots()
. We can covert them to cached_property
, but we'll need to change how they're called, i.e. from slots()
to slots
, unless we create a wrapper around cachedproperty, but that got us into this trouble in the first place.
Unfortunately, there's some hacking around slots
that may not make this possible, so we need to decide/benchmark if caching slots
is ultimately even worth it (and we can avoid a deprecation in that case):
https://github.com/PyCQA/astroid/blob/cd0f896672049493b2e3cfc87a327c871f8dd329/astroid/brain/brain_typing.py#L169-L173
So, I suggest:
- deprecate
astroid.decorators.cached
- remove
astroid._cache.CacheManager
- Benchmark if caching
slots()
is even worth it - Convert
_get_assign_nodes()
to cached_property
We don't need to deprecate _cache
. All else sounds good, I don't mind breaking APIs as long as pylint
benefits from it.
Oh, right, and _get_assign_nodes
is private, duh! Edited to-do list.