astroid
astroid copied to clipboard
Feature: Persistent caching of inference results
Steps to reproduce
For fairly large repositories, using inference can be incredibly slow. Running pylint
with a rule which leverages inference can take multiple seconds, while running a rule without inference is almost immediate.
From profiling, most of the time taken is traversing the file's dependencies and parsing them as well. This information is cached (see astroid.manager.AtroidManager.brain
), however the cache is in-memory and therefore doesn't last between runs. This is likely due to the fact that astroid
classes can't be easily serialized (pickling them is unfortunately not a drop-in solution).
Ideally, there is an option to enable persistent caching which is file-timestamp aware. If a particular module hasn't been changed, astroid
could pull from the cache (exactly like Python's caching of bytecode).
Of course, a lot of thought has to be put into caching, but the benefits would be on the order of dev-years if not dev-decades IMO.
Current behavior
- Run
pylint
on a single file in a large repo withassert-on-tuple
as the only rule: Maybe ~0.5s on a bad day - Run
pylint
on the same file withabstract-class-instantiated
as the only rule: Roughly 7.4s - Run
pylint
twice in the same python process withabstract-class-instantiated
as the only rule: Also roughly 7.4s
Expected behavior
Running pylint
multiple times on the same file ideally doesn't take the same amount of time each time.
python -c "from astroid import __pkginfo__; print(__pkginfo__.version)"
output
2.4.2
FWIW I'm fully aware that this is no easy undertaking. However it should at least be documented as a feature request and talked about :smile:
I'll add too, AFAICT mypy
runs speedily on the same codebase. I assume since it has a cache-dir
flag it is caching post-parsed information.
Thank you for opening the issue, I agree with everything you said. I would add that making astroid performance better would save gigawatt of electricity. What I'm saying it that AWS should sponsor this issue :smile:
In the meantime I think I'll play around with running pylint
as part of a daemon in order to get the performance improvement on dev boxes (since it'll keep the manager brain in-memory). Then in the daemon process, eject modules from the brain when files change.
I've been playing around with this in my free time and am getting close to a really good Proof-of-Concept/MvP.
Hopefully by next weekend I'll have a PR to show!
Cross-posting it here to preserve it for later discussions.
-- I'm not convinced we'll see any noticeable improvement with this approach. As far as I'm aware most time is spend during the actual inference. Parsing the ast tree by contrast is fairly quick. So much so that I might bet we won't see measurable changes when constructing it from cache instead.
A more useful approach IMO would be to try and cache the inference results for each file although it still needs to be investigated to what extend that is possible.
Originally posted by @cdce8p in https://github.com/PyCQA/astroid/pull/1194#pullrequestreview-787496871
(Perhaps the title should be edited to reflect the actual work needing done)
Cross-posting my reply:
I agree the AST parsing caching itself wouldn't be sufficient (hence why there isn't any code that actually persists the AST to disk yet). I assumed this would be the first step that would eventually lead to inference caching (since the inference results would need to be persistable).
@joshua-cannon-techlabs said:
From profiling, most of the time taken is traversing the file's dependencies and parsing them as well.
@cdce8p said:
As far as I'm aware most time is spend during the actual inference. Parsing the ast tree by contrast is fairly quick.
This sounds like we need some quantification, which tox -e profile_against_external
or tox -e benchmark
in pylint should give us. Here a profile graph from running pylint against numpy profile-numpy-2021-11-22.zip
My experience (when introducing https://github.com/PyCQA/pylint/pull/3498 and #3473) was more in line with @joshua-cannon-techlabs in that there's a lot of time spent in upfront file-work, which seems to be multiplied by the number of jobs being run - which suggests that under -j N
astroid does the same work N times, instead of just once. I'm running the tox -e profile_against_external
and will confirm. Update: the SVG from the numpy profile is attached - I've not yet looked at it myself. profile-numpy-2021-11-22.zip
So, the implication is that a potential fast-win is to generate the cache/AST on the main-thread and distribute that (as execplictly read-only) to the sub-procs. An even faster win might be to move the multiprocessing to the astroid loop in check_files()
, but that's probably out-of-scope for now.
That said, I very much agree with @cdce8p that a simple cache isn't the be-all and end-all. The cache has to match the use-case, which in this case would be both of the following:
- how do we reduce the amount of inference work [re]done?
- how do we get that inference work done faster?
My AST know-how is rusty but IIRC you should be able to generate a dependency tree that can be dirtied based on changes to the tree. Then a partial solution to (1.) is that the work to be done is contingent on that dirty-state. The majority of checkers could adhere to that dirty state but some wouldn't, I imagine (edge cases etc.).
You should then also get, for free, the ability to better multi-thread tasks that operate on that tree, which partially solves (2.).
Considering similar-checker
, we could/should use the cache directly rather than create yet another copy of the data in memory.