astroid icon indicating copy to clipboard operation
astroid copied to clipboard

Feature: Persistent caching of inference results

Open joshua-cannon-techlabs opened this issue 3 years ago • 8 comments

Steps to reproduce

For fairly large repositories, using inference can be incredibly slow. Running pylint with a rule which leverages inference can take multiple seconds, while running a rule without inference is almost immediate.

From profiling, most of the time taken is traversing the file's dependencies and parsing them as well. This information is cached (see astroid.manager.AtroidManager.brain), however the cache is in-memory and therefore doesn't last between runs. This is likely due to the fact that astroid classes can't be easily serialized (pickling them is unfortunately not a drop-in solution).

Ideally, there is an option to enable persistent caching which is file-timestamp aware. If a particular module hasn't been changed, astroid could pull from the cache (exactly like Python's caching of bytecode).

Of course, a lot of thought has to be put into caching, but the benefits would be on the order of dev-years if not dev-decades IMO.

Current behavior

  • Run pylint on a single file in a large repo with assert-on-tuple as the only rule: Maybe ~0.5s on a bad day
  • Run pylint on the same file with abstract-class-instantiated as the only rule: Roughly 7.4s
  • Run pylint twice in the same python process with abstract-class-instantiated as the only rule: Also roughly 7.4s

Expected behavior

Running pylint multiple times on the same file ideally doesn't take the same amount of time each time.

python -c "from astroid import __pkginfo__; print(__pkginfo__.version)" output

2.4.2

joshua-cannon-techlabs avatar Aug 23 '21 14:08 joshua-cannon-techlabs

FWIW I'm fully aware that this is no easy undertaking. However it should at least be documented as a feature request and talked about :smile:

joshua-cannon-techlabs avatar Aug 23 '21 15:08 joshua-cannon-techlabs

I'll add too, AFAICT mypy runs speedily on the same codebase. I assume since it has a cache-dir flag it is caching post-parsed information.

joshua-cannon-techlabs avatar Aug 23 '21 15:08 joshua-cannon-techlabs

Thank you for opening the issue, I agree with everything you said. I would add that making astroid performance better would save gigawatt of electricity. What I'm saying it that AWS should sponsor this issue :smile:

Pierre-Sassoulas avatar Aug 23 '21 15:08 Pierre-Sassoulas

In the meantime I think I'll play around with running pylint as part of a daemon in order to get the performance improvement on dev boxes (since it'll keep the manager brain in-memory). Then in the daemon process, eject modules from the brain when files change.

joshua-cannon-techlabs avatar Aug 23 '21 15:08 joshua-cannon-techlabs

I've been playing around with this in my free time and am getting close to a really good Proof-of-Concept/MvP.

Hopefully by next weekend I'll have a PR to show!

thejcannon avatar Sep 26 '21 16:09 thejcannon

Cross-posting it here to preserve it for later discussions.

-- I'm not convinced we'll see any noticeable improvement with this approach. As far as I'm aware most time is spend during the actual inference. Parsing the ast tree by contrast is fairly quick. So much so that I might bet we won't see measurable changes when constructing it from cache instead.

A more useful approach IMO would be to try and cache the inference results for each file although it still needs to be investigated to what extend that is possible.

Originally posted by @cdce8p in https://github.com/PyCQA/astroid/pull/1194#pullrequestreview-787496871

cdce8p avatar Nov 06 '21 13:11 cdce8p

(Perhaps the title should be edited to reflect the actual work needing done)

Cross-posting my reply:

I agree the AST parsing caching itself wouldn't be sufficient (hence why there isn't any code that actually persists the AST to disk yet). I assumed this would be the first step that would eventually lead to inference caching (since the inference results would need to be persistable).

thejcannon avatar Nov 06 '21 14:11 thejcannon

@joshua-cannon-techlabs said:

From profiling, most of the time taken is traversing the file's dependencies and parsing them as well.

@cdce8p said:

As far as I'm aware most time is spend during the actual inference. Parsing the ast tree by contrast is fairly quick.

This sounds like we need some quantification, which tox -e profile_against_external or tox -e benchmark in pylint should give us. Here a profile graph from running pylint against numpy profile-numpy-2021-11-22.zip

My experience (when introducing https://github.com/PyCQA/pylint/pull/3498 and #3473) was more in line with @joshua-cannon-techlabs in that there's a lot of time spent in upfront file-work, which seems to be multiplied by the number of jobs being run - which suggests that under -j N astroid does the same work N times, instead of just once. I'm running the tox -e profile_against_external and will confirm. Update: the SVG from the numpy profile is attached - I've not yet looked at it myself. profile-numpy-2021-11-22.zip

So, the implication is that a potential fast-win is to generate the cache/AST on the main-thread and distribute that (as execplictly read-only) to the sub-procs. An even faster win might be to move the multiprocessing to the astroid loop in check_files(), but that's probably out-of-scope for now.

That said, I very much agree with @cdce8p that a simple cache isn't the be-all and end-all. The cache has to match the use-case, which in this case would be both of the following:

  1. how do we reduce the amount of inference work [re]done?
  2. how do we get that inference work done faster?

My AST know-how is rusty but IIRC you should be able to generate a dependency tree that can be dirtied based on changes to the tree. Then a partial solution to (1.) is that the work to be done is contingent on that dirty-state. The majority of checkers could adhere to that dirty state but some wouldn't, I imagine (edge cases etc.).

You should then also get, for free, the ability to better multi-thread tasks that operate on that tree, which partially solves (2.).

Considering similar-checker, we could/should use the cache directly rather than create yet another copy of the data in memory.

doublethefish avatar Nov 22 '21 13:11 doublethefish