astroid icon indicating copy to clipboard operation
astroid copied to clipboard

Astroid is leaking memory

Open matusvalo opened this issue 2 years ago • 5 comments

Astroid is leaking memory when is run multiple times over same file using API

Steps to reproduce

  1. Pick larger file, e.g. file from pylint project: pylint/checkers/variables.py
  2. Run following script:
from pylint.lint import Run
from pylint.lint import pylinter
while True:
    Run(["pylint/checkers/variables.py"], exit=False)
    pylinter.MANAGER.clear_cache()

Current behavior

After ~12 iteration memory goes over 1 GB and keeps growing image

Expected behavior

Allocated memory should not rise indefinitelly

python -c "from astroid import __pkginfo__; print(__pkginfo__.version)" output

2.13.0-dev0

matusvalo avatar Sep 13 '22 06:09 matusvalo

I did some investigation - I profiled memory using muppy:

from pylint.lint import Run
from pylint.lint import pylinter
from pympler import muppy
from pympler import summary

Run(["pylint/checkers/variables.py"], exit=False)
pylinter.MANAGER.clear_cache()

all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)

for _ in range(15):
    Run(["pylint/checkers/variables.py"], exit=False)
    pylinter.MANAGER.clear_cache()

all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)

I got following result:

                                  types |   # objects |   total size
======================================= | =========== | ============
                                   dict |      221794 |     60.35 MB
                                   list |      212895 |     14.44 MB
                                    str |       67663 |      5.82 MB
                                    int |      185941 |      4.97 MB
                                  tuple |       52200 |      2.78 MB
        astroid.nodes.node_classes.Name |       39949 |      1.83 MB
       astroid.nodes.node_classes.Const |       36755 |      1.68 MB
                                   code |        8153 |      1.56 MB
                     tokenize.TokenInfo |       16652 |      1.40 MB
                                   type |        1311 |      1.18 MB
                                 method |       13220 |    826.25 KB
   astroid.nodes.node_classes.Attribute |       12100 |    567.19 KB
  astroid.nodes.node_classes.AssignName |       11793 |    552.80 KB
        astroid.nodes.node_classes.Call |        9234 |    432.84 KB
                                    set |         555 |    389.32 KB

                                                types |   # objects |   total size
===================================================== | =========== | ============
                                                 dict |     2599736 |    764.26 MB
                                                 list |     2394537 |    171.48 MB
                                                  int |     2635012 |     70.37 MB
                                                  str |      387148 |     32.07 MB
                      astroid.nodes.node_classes.Name |      522253 |     23.91 MB
                     astroid.nodes.node_classes.Const |      375519 |     17.19 MB
                                               method |      201034 |     12.27 MB
                                                tuple |      231507 |     12.12 MB
                 astroid.nodes.node_classes.Attribute |      175345 |      8.03 MB
                astroid.nodes.node_classes.AssignName |      161694 |      7.40 MB
                      astroid.nodes.node_classes.Call |      127355 |      5.83 MB
                 astroid.nodes.node_classes.Arguments |       85926 |      3.93 MB
  astroid.nodes.scoped_nodes.scoped_nodes.FunctionDef |       84896 |      3.89 MB
                                      FunctionWrapper |       37788 |      3.46 MB
                                   tokenize.TokenInfo |       39786 |      3.34 MB

Clearly majority of allocated data are dictionaries and lists...

matusvalo avatar Sep 13 '22 07:09 matusvalo

It's hard to know what is inside the dict/list without being able to explore the data. But it's probably part of the AST that are not cleaned up properly, right ?

Pierre-Sassoulas avatar Sep 13 '22 08:09 Pierre-Sassoulas

It's hard to know what is inside the dict/list without being able to explore the data.

I am able to get to that information. If you run following script:

from pylint.lint import Run
from pylint.lint import pylinter
from pympler import muppy
from pympler import summary
from pympler import asizeof

Run(["pylint/checkers/variables.py"], exit=False)
pylinter.MANAGER.clear_cache()

all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)

obj = []
for o in all_objects:
    if not isinstance(o, dict):
        continue
    try:
        # some dictionary items are raising exception when accessed
        _ = str(o)
    except Exception:
        continue
    obj.append(o)

Please notify that cache was cleaned so, in theory the memory should not contain any extra astroid information about AST.

You get a lot of following dictionaries:

>>> pprint(obj[-5:])
[{'col_offset': 29,
  'end_col_offset': 33,
  'end_lineno': 1674,
  'fromlineno': 1674,
  'lineno': 1674,
  'name': 'node',
  'parent': <Attribute.name l.1674 at 0x1125b3490>,
  'position': None},
 {'col_offset': 37,
  'end_col_offset': 42,
  'end_lineno': 1436,
  'fromlineno': 1436,
  'lineno': 1436,
  'name': 'nodes',
  'parent': <Attribute.FunctionDef l.1436 at 0x112589fd0>,
  'position': None},
 {'col_offset': 46,
  'end_col_offset': 59,
  'end_lineno': 1876,
  'fromlineno': 1876,
  'lineno': 1876,
  'name': 'default_names',
  'parent': <Call l.1876 at 0x111e92fd0>,
  'position': None},
 {'col_offset': 64,
  'end_col_offset': 68,
  'end_lineno': 2105,
  'fromlineno': 2105,
  'lineno': 2105,
  'name': 'node',
  'parent': <Attribute.node_ancestors l.2105 at 0x11260cfa0>,
  'position': None},
 {'col_offset': 31,
  'end_col_offset': 39,
  'end_lineno': 2105,
  'fromlineno': 2105,
  'lineno': 2105,
  'name': 'ref_node',
  'parent': <Attribute.parent l.2105 at 0x11260ce80>,
  'position': None}]
>>> len(obj)
216564
>>> len([i for i in obj if 'position' in i])
160789
>>> len([i for i in obj if 'parent' in i])
161191

Could you help me understand where in the code base these dictionaries can be created?

matusvalo avatar Sep 13 '22 08:09 matusvalo

It does look like the AST, so probably astroid.parse (?) and/or possibly astroid.NodeNG.infer (?)

Pierre-Sassoulas avatar Sep 13 '22 09:09 Pierre-Sassoulas

Those dictionaries are the arguments that get passed to postinit functions I think. At least, that is where their reference is according to the gc module.

I tried to do some digging, but I really have no experience here. The only thing I could come up with is that something is sustaining a reference to a number of nodes.NodeNG objects which prevent them from being garbage collected.

import gc
from io import StringIO

from pylint.lint import Run, pylinter
from pylint.reporters.text import TextReporter
from pympler import muppy, summary

from astroid import nodes

prev_nodes = 0
prev_dicts = 0

for _ in range(5):
    count_nodes = 0
    count_dicts = 0
    Run(
        ["astroid/nodes/node_classes.py"], exit=False, reporter=TextReporter(StringIO())
    )
    pylinter.MANAGER.clear_cache()
    gc.collect()

    all_objects = muppy.get_objects()
    for o in muppy.get_objects():
        if isinstance(o, nodes.NodeNG):
            count_nodes += 1
        elif isinstance(o, dict):
            count_dicts += 1
    print()
    print("Nodes INC:", count_nodes - prev_nodes)
    prev_nodes = count_nodes
    print("Nodes NEW:", prev_nodes)

    print("Dicts INC:", count_dicts - prev_dicts)
    prev_dicts = count_dicts
    print("Dicts NEW:", prev_dicts)

This shows that the increase in both node and dict objects is constant after initial startup.

❯ python test2.py

Nodes INC: 137841
Nodes NEW: 137841
Dicts INC: 68924
Dicts NEW: 68924

Nodes INC: 112234
Nodes NEW: 250075
Dicts INC: 51192
Dicts NEW: 120116

Nodes INC: 112234
Nodes NEW: 362309
Dicts INC: 51191
Dicts NEW: 171307

Nodes INC: 112234
Nodes NEW: 474543
Dicts INC: 51191
Dicts NEW: 222498

Nodes INC: 112234
Nodes NEW: 586777
Dicts INC: 51191
Dicts NEW: 273689

As far as my understanding of garbage collections goes this show that the persistence of these objects does not come from the garbage collection randomness but from actual references to these objects where they probably shouldn't be referenced.

My first thought would be to make .parent a weakref to test whether that would have an impact, but that isn't really feasible... Perhaps somebody has a good strategy to understanding why the gc doesn't collect certain objects?

DanielNoord avatar Sep 13 '22 15:09 DanielNoord

My first thought would be to make .parent a weakref to test whether that would have an impact, but that isn't really feasible...

I think that's exactly what's going on. The child nodes are still holding references to the module in .parent, so when the linter's refcount goes to zero, the module's refcount never has a chance to go to zero also. Seems like a classic use case for weakrefs.

jacobtylerwalls avatar Mar 04 '23 15:03 jacobtylerwalls

That would be a massive breaking change though... So not sure it is feasible. I would definitely want to help explore it though.

DanielNoord avatar Mar 04 '23 15:03 DanielNoord

Yep, 3.0 or nothing! ;)

jacobtylerwalls avatar Mar 04 '23 15:03 jacobtylerwalls

I think I ran into some issues with the WeakRef PR though as it breaks some use cases. Not sure it would actually work..

DanielNoord avatar Mar 04 '23 15:03 DanielNoord

music21 has a very similar use case, with a set of weakref tools here, and then (in some cases) properties for syntactic sugar when setting / retrieving parents.

One track to explore: we could make NodeNG.parent a property, and have its getter and setter call similar utilities as I linked above. The BC implications would be so small we may not even need to wait for 3.0: calling .parent would still give NodeNG or None.

jacobtylerwalls avatar Mar 04 '23 15:03 jacobtylerwalls

Sounds like it is worth exploring. Is that something you would want to build?

DanielNoord avatar Mar 04 '23 16:03 DanielNoord

Yep, I'm already seeing some promising results. Looking to do a sequence of small PRs.

jacobtylerwalls avatar Mar 04 '23 19:03 jacobtylerwalls

Here's a version of Daniel's script from https://github.com/PyCQA/astroid/issues/1780#issuecomment-1245592744 that takes care to delete the references created in the measurements:

import gc

from pylint.lint import Run
from pympler import muppy

from astroid import nodes

prev_nodes = 0
prev_dicts = 0

for _ in range(5):
    count_nodes = 0
    count_dicts = 0
    Run(
        ["astroid/nodes/node_classes.py", '--clear-cache-post-run=y'], exit=False
    )
    gc.collect()

    all_objects = muppy.get_objects()
    for o in muppy.get_objects():
        if isinstance(o, nodes.NodeNG):
            count_nodes += 1
        elif isinstance(o, dict):
            count_dicts += 1
    print()
    print("Nodes INC:", count_nodes - prev_nodes)
    prev_nodes = count_nodes
    print("Nodes NEW:", prev_nodes)

    print("Dicts INC:", count_dicts - prev_dicts)
    prev_dicts = count_dicts
    print("Dicts NEW:", prev_dicts)
    del all_objects
    del o

jacobtylerwalls avatar Mar 12 '23 12:03 jacobtylerwalls

Looks like we've solved the node leaking, just dicts are left:

$ python3 memory.py

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Nodes INC: 102907
Nodes NEW: 102907
Dicts INC: 65091
Dicts NEW: 65091

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11472
Dicts NEW: 76563

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11460
Dicts NEW: 88023

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11460
Dicts NEW: 99483

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)


Nodes INC: 0
Nodes NEW: 102907
Dicts INC: 11460
Dicts NEW: 110943

jacobtylerwalls avatar Mar 12 '23 12:03 jacobtylerwalls

The remaining uses seem to derive from the custom caching decorator in astroid. Looking into converting those uses to just cached_property.

jacobtylerwalls avatar Mar 12 '23 13:03 jacobtylerwalls

If we want to close this out, here are some next steps for 3.0.

The two remaining offenders are _get_assign_nodes() and slots(). We can covert them to cached_property, but we'll need to change how they're called, i.e. from slots() to slots, unless we create a wrapper around cachedproperty, but that got us into this trouble in the first place.

Unfortunately, there's some hacking around slots that may not make this possible, so we need to decide/benchmark if caching slots is ultimately even worth it (and we can avoid a deprecation in that case):

https://github.com/PyCQA/astroid/blob/cd0f896672049493b2e3cfc87a327c871f8dd329/astroid/brain/brain_typing.py#L169-L173

So, I suggest:

  • deprecate astroid.decorators.cached
  • remove astroid._cache.CacheManager
  • Benchmark if caching slots() is even worth it
  • Convert _get_assign_nodes() to cached_property

jacobtylerwalls avatar Mar 12 '23 14:03 jacobtylerwalls

We don't need to deprecate _cache. All else sounds good, I don't mind breaking APIs as long as pylint benefits from it.

DanielNoord avatar Mar 12 '23 14:03 DanielNoord

Oh, right, and _get_assign_nodes is private, duh! Edited to-do list.

jacobtylerwalls avatar Mar 12 '23 14:03 jacobtylerwalls