astroid Astroid is leaking memory

Astroid is leaking memory when is run multiple times over same file using API

Steps to reproduce

Pick larger file, e.g. file from pylint project: pylint/checkers/variables.py
Run following script:

from pylint.lint import Run
from pylint.lint import pylinter
while True:
    Run(["pylint/checkers/variables.py"], exit=False)
    pylinter.MANAGER.clear_cache()

Current behavior

After ~12 iteration memory goes over 1 GB and keeps growing

Expected behavior

Allocated memory should not rise indefinitelly

`python -c "from astroid import pkginfo; print(pkginfo.version)"` output

2.13.0-dev0

Sep 13 '22 06:09 matusvalo

I did some investigation - I profiled memory using muppy:

from pylint.lint import Run
from pylint.lint import pylinter
from pympler import muppy
from pympler import summary

Run(["pylint/checkers/variables.py"], exit=False)
pylinter.MANAGER.clear_cache()

all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)

for _ in range(15):
    Run(["pylint/checkers/variables.py"], exit=False)
    pylinter.MANAGER.clear_cache()

all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)

I got following result:

                                  types |   # objects |   total size
======================================= | =========== | ============
                                   dict |      221794 |     60.35 MB
                                   list |      212895 |     14.44 MB
                                    str |       67663 |      5.82 MB
                                    int |      185941 |      4.97 MB
                                  tuple |       52200 |      2.78 MB
        astroid.nodes.node_classes.Name |       39949 |      1.83 MB
       astroid.nodes.node_classes.Const |       36755 |      1.68 MB
                                   code |        8153 |      1.56 MB
                     tokenize.TokenInfo |       16652 |      1.40 MB
                                   type |        1311 |      1.18 MB
                                 method |       13220 |    826.25 KB
   astroid.nodes.node_classes.Attribute |       12100 |    567.19 KB
  astroid.nodes.node_classes.AssignName |       11793 |    552.80 KB
        astroid.nodes.node_classes.Call |        9234 |    432.84 KB
                                    set |         555 |    389.32 KB

                                                types |   # objects |   total size
===================================================== | =========== | ============
                                                 dict |     2599736 |    764.26 MB
                                                 list |     2394537 |    171.48 MB
                                                  int |     2635012 |     70.37 MB
                                                  str |      387148 |     32.07 MB
                      astroid.nodes.node_classes.Name |      522253 |     23.91 MB
                     astroid.nodes.node_classes.Const |      375519 |     17.19 MB
                                               method |      201034 |     12.27 MB
                                                tuple |      231507 |     12.12 MB
                 astroid.nodes.node_classes.Attribute |      175345 |      8.03 MB
                astroid.nodes.node_classes.AssignName |      161694 |      7.40 MB
                      astroid.nodes.node_classes.Call |      127355 |      5.83 MB
                 astroid.nodes.node_classes.Arguments |       85926 |      3.93 MB
  astroid.nodes.scoped_nodes.scoped_nodes.FunctionDef |       84896 |      3.89 MB
                                      FunctionWrapper |       37788 |      3.46 MB
                                   tokenize.TokenInfo |       39786 |      3.34 MB

Clearly majority of allocated data are dictionaries and lists...

Sep 13 '22 07:09 matusvalo

It's hard to know what is inside the dict/list without being able to explore the data. But it's probably part of the AST that are not cleaned up properly, right ?

Sep 13 '22 08:09 Pierre-Sassoulas

It's hard to know what is inside the dict/list without being able to explore the data.

I am able to get to that information. If you run following script:

from pylint.lint import Run
from pylint.lint import pylinter
from pympler import muppy
from pympler import summary
from pympler import asizeof

Run(["pylint/checkers/variables.py"], exit=False)
pylinter.MANAGER.clear_cache()

all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)

obj = []
for o in all_objects:
    if not isinstance(o, dict):
        continue
    try:
        # some dictionary items are raising exception when accessed
        _ = str(o)
    except Exception:
        continue
    obj.append(o)

Please notify that cache was cleaned so, in theory the memory should not contain any extra astroid information about AST.

You get a lot of following dictionaries:

>>> pprint(obj[-5:])
[{'col_offset': 29,
  'end_col_offset': 33,
  'end_lineno': 1674,
  'fromlineno': 1674,
  'lineno': 1674,
  'name': 'node',
  'parent': <Attribute.name l.1674 at 0x1125b3490>,
  'position': None},
 {'col_offset': 37,
  'end_col_offset': 42,
  'end_lineno': 1436,
  'fromlineno': 1436,
  'lineno': 1436,
  'name': 'nodes',
  'parent': <Attribute.FunctionDef l.1436 at 0x112589fd0>,
  'position': None},
 {'col_offset': 46,
  'end_col_offset': 59,
  'end_lineno': 1876,
  'fromlineno': 1876,
  'lineno': 1876,
  'name': 'default_names',
  'parent': <Call l.1876 at 0x111e92fd0>,
  'position': None},
 {'col_offset': 64,
  'end_col_offset': 68,
  'end_lineno': 2105,
  'fromlineno': 2105,
  'lineno': 2105,
  'name': 'node',
  'parent': <Attribute.node_ancestors l.2105 at 0x11260cfa0>,
  'position': None},
 {'col_offset': 31,
  'end_col_offset': 39,
  'end_lineno': 2105,
  'fromlineno': 2105,
  'lineno': 2105,
  'name': 'ref_node',
  'parent': <Attribute.parent l.2105 at 0x11260ce80>,
  'position': None}]

>>> len(obj)
216564
>>> len([i for i in obj if 'position' in i])
160789
>>> len([i for i in obj if 'parent' in i])
161191

Could you help me understand where in the code base these dictionaries can be created?

Sep 13 '22 08:09 matusvalo

It does look like the AST, so probably astroid.parse (?) and/or possibly astroid.NodeNG.infer (?)

Sep 13 '22 09:09 Pierre-Sassoulas

Those dictionaries are the arguments that get passed to postinit functions I think. At least, that is where their reference is according to the gc module.

I tried to do some digging, but I really have no experience here. The only thing I could come up with is that something is sustaining a reference to a number of nodes.NodeNG objects which prevent them from being garbage collected.

import gc
from io import StringIO

from pylint.lint import Run, pylinter
from pylint.reporters.text import TextReporter
from pympler import muppy, summary

from astroid import nodes

prev_nodes = 0
prev_dicts = 0

for _ in range(5):
    count_nodes = 0
    count_dicts = 0
    Run(
        ["astroid/nodes/node_classes.py"], exit=False, reporter=TextReporter(StringIO())
    )
    pylinter.MANAGER.clear_cache()
    gc.collect()

    all_objects = muppy.get_objects()
    for o in muppy.get_objects():
        if isinstance(o, nodes.NodeNG):
            count_nodes += 1
        elif isinstance(o, dict):
            count_dicts += 1
    print()
    print("Nodes INC:", count_nodes - prev_nodes)
    prev_nodes = count_nodes
    print("Nodes NEW:", prev_nodes)

    print("Dicts INC:", count_dicts - prev_dicts)
    prev_dicts = count_dicts
    print("Dicts NEW:", prev_dicts)

This shows that the increase in both node and dict objects is constant after initial startup.

❯ python test2.py

Nodes INC: 137841
Nodes NEW: 137841
Dicts INC: 68924
Dicts NEW: 68924

Nodes INC: 112234
Nodes NEW: 250075
Dicts INC: 51192
Dicts NEW: 120116

Nodes INC: 112234
Nodes NEW: 362309
Dicts INC: 51191
Dicts NEW: 171307

Nodes INC: 112234
Nodes NEW: 474543
Dicts INC: 51191
Dicts NEW: 222498

Nodes INC: 112234
Nodes NEW: 586777
Dicts INC: 51191
Dicts NEW: 273689

As far as my understanding of garbage collections goes this show that the persistence of these objects does not come from the garbage collection randomness but from actual references to these objects where they probably shouldn't be referenced.

My first thought would be to make .parent a weakref to test whether that would have an impact, but that isn't really feasible... Perhaps somebody has a good strategy to understanding why the gc doesn't collect certain objects?

Sep 13 '22 15:09 DanielNoord

My first thought would be to make .parent a weakref to test whether that would have an impact, but that isn't really feasible...

I think that's exactly what's going on. The child nodes are still holding references to the module in .parent, so when the linter's refcount goes to zero, the module's refcount never has a chance to go to zero also. Seems like a classic use case for weakrefs.

Mar 04 '23 15:03 jacobtylerwalls

That would be a massive breaking change though... So not sure it is feasible. I would definitely want to help explore it though.

Mar 04 '23 15:03 DanielNoord

Yep, 3.0 or nothing! ;)

Mar 04 '23 15:03 jacobtylerwalls

I think I ran into some issues with the WeakRef PR though as it breaks some use cases. Not sure it would actually work..

Mar 04 '23 15:03 DanielNoord

music21 has a very similar use case, with a set of weakref tools here, and then (in some cases) properties for syntactic sugar when setting / retrieving parents.

One track to explore: we could make NodeNG.parent a property, and have its getter and setter call similar utilities as I linked above. The BC implications would be so small we may not even need to wait for 3.0: calling .parent would still give NodeNG or None.

Mar 04 '23 15:03 jacobtylerwalls

Sounds like it is worth exploring. Is that something you would want to build?

Mar 04 '23 16:03 DanielNoord

Yep, I'm already seeing some promising results. Looking to do a sequence of small PRs.

Mar 04 '23 19:03 jacobtylerwalls

Here's a version of Daniel's script from https://github.com/PyCQA/astroid/issues/1780#issuecomment-1245592744 that takes care to delete the references created in the measurements:

import gc

from pylint.lint import Run
from pympler import muppy

from astroid import nodes

prev_nodes = 0
prev_dicts = 0

for _ in range(5):
    count_nodes = 0
    count_dicts = 0
    Run(
        ["astroid/nodes/node_classes.py", '--clear-cache-post-run=y'], exit=False
    )
    gc.collect()

    all_objects = muppy.get_objects()
    for o in muppy.get_objects():
        if isinstance(o, nodes.NodeNG):
            count_nodes += 1
        elif isinstance(o, dict):
            count_dicts += 1
    print()
    print("Nodes INC:", count_nodes - prev_nodes)
    prev_nodes = count_nodes
    print("Nodes NEW:", prev_nodes)

    print("Dicts INC:", count_dicts - prev_dicts)
    prev_dicts = count_dicts
    print("Dicts NEW:", prev_dicts)
    del all_objects
    del o

Mar 12 '23 12:03 jacobtylerwalls

Looks like we've solved the node leaking, just dicts are left:

$ python3 memory.py

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Nodes INC: 102907
Nodes NEW: 102907
Dicts INC: 65091
Dicts NEW: 65091

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11472
Dicts NEW: 76563

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11460
Dicts NEW: 88023

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11460
Dicts NEW: 99483

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11460
Dicts NEW: 110943

Mar 12 '23 12:03 jacobtylerwalls

The remaining uses seem to derive from the custom caching decorator in astroid. Looking into converting those uses to just cached_property.

Mar 12 '23 13:03 jacobtylerwalls

If we want to close this out, here are some next steps for 3.0.

The two remaining offenders are _get_assign_nodes() and slots(). We can covert them to cached_property, but we'll need to change how they're called, i.e. from slots() to slots, unless we create a wrapper around cachedproperty, but that got us into this trouble in the first place.

Unfortunately, there's some hacking around slots that may not make this possible, so we need to decide/benchmark if caching slots is ultimately even worth it (and we can avoid a deprecation in that case):

https://github.com/PyCQA/astroid/blob/cd0f896672049493b2e3cfc87a327c871f8dd329/astroid/brain/brain_typing.py#L169-L173

So, I suggest:

deprecate astroid.decorators.cached
remove astroid._cache.CacheManager
Benchmark if caching slots() is even worth it
Convert _get_assign_nodes() to cached_property

Mar 12 '23 14:03 jacobtylerwalls

We don't need to deprecate _cache. All else sounds good, I don't mind breaking APIs as long as pylint benefits from it.

Mar 12 '23 14:03 DanielNoord

Oh, right, and _get_assign_nodes is private, duh! Edited to-do list.

Mar 12 '23 14:03 jacobtylerwalls

astroid astroid copied to clipboard

Astroid is leaking memory

Steps to reproduce

Current behavior

Expected behavior

python -c "from astroid import __pkginfo__; print(__pkginfo__.version)" output

astroid
astroid copied to clipboard

`python -c "from astroid import pkginfo; print(pkginfo.version)"` output