formatting-stack icon indicating copy to clipboard operation
formatting-stack copied to clipboard

Idea: don't re-analyze files depending on their SHA

Open vemv opened this issue 4 years ago • 2 comments

Problem statement

Any given linter (particularly Eastwood) can be slow.

For the use case of "running branch-formatter integrated with the repl", one can observe that one might be linting again files that haven't been modified.

For example:

  • I'm working in a long-lived branch that has touched 30 files
    • accordingly, the branch-formatter strategy will dictate that 30 files must be analyzed per lint! invocation
  • But I'm only actively working in a smaller subset, like 1 or 2 files
    • The other files touched by the branch may have been already analyzed, which makes re-analyzing them redundant.

Proposal

Implement a strategy that removes files if all these conditions are met:

  • the given file has already been successfully analyzed
    • e.g. has warnings or empty warnings, but no exception was thrown
  • the SHA of the file's contents hasn't changed
    • (File#lastModified can also work)
  • no linter warnings have been emitted for this particular file
    • this is important - otherwise one would show linter warnings just once, and then elide them in subsequent runs because the file hadn't changed.

Caveats

It's plausible that fixing a given file's linting result does not depend entirely on the file itself: fixing it might be accomplished by modifying a different file.

That could cause stale caches or such.

However I'm not immediately aware if such a case. Still worth a think, on a per-linter basis.

Applicability

This optimization is only possibly useful in a local dev repl, so out of caution, I would not add the proposed strategy to the stack if (System/getenv "CI").

Other

I think that the cache should be keyed per git-branch and possibly deleted on each detected git branch change anyway. This is a basic caution against the mentioned Caveats.

Alternatives and comparison

Cache linters' reports. e.g. (caching-linter/new (linters.kondo/new {}))

Might perform better.

cc/ @thumbnail

vemv avatar Feb 19 '21 13:02 vemv

@thumbnail I'm tempted to give a shot to the alternative you suggested. I have the vague feeling that it can be something thin to implement.

IOW, this wouldn't conflict with architectural redesigns - we could always scrap my work, but keep the integration tests.

Assuming the resulting PR is indeed thin, does this plan SGTY?

vemv avatar Mar 19 '21 12:03 vemv

I think a wrapper is the most viable option. Simply using lastModified as a cache key to prevent files from being linted, and merge previous/cached reports in the end result.

I think the branch/project formatter shouldn't use the cached versions by default, that way CI doesn't either.

thumbnail avatar Mar 19 '21 17:03 thumbnail