nogo: improve cache friendliness
Context
See #4326. I'm collecting out-of-band discussion with @linzhp, @tyler-french, and @fmeum.
In short, Uber is using Nogo at scale, and a recent correctness fix has caused Nogo to be much slower. They have about 150 analyzers (a mix of the standard vet analyzers, staticcheck, and some custom packages). They also have a large nogo.json file that changes frequently.
We'd prefer not to revert the correctness fix (#4268), but we do want to find ways to ensure Nogo actions hit the cache more frequently.
Background
- A Nogo analyzer is an implementation of
analysis.Pass. - All the analyzers for a repo get compiled into the same binary. This means changing the set of analyzers necessarily invalidates the cache.
- Nogo actions are run separately from Go compilation. Currently, a single action runs all analyzers on a single package.
- A nogo analyzer can produce facts, which are data about a package that can be consumed by analyzers run on importing package. For example, the
printfanalyzer records which functions defined in a package accept printf format strings. - We currently collect facts from all analyzers into a single output file per package.
- Each Nogo action takes facts files from all transitively imported packages as inputs. This probably isn't necessary: we should take facts only from directly imported packages.
nogo.jsondetermines files that are included and excluded from analysis. This mainly records whether we report findings. We generally need to run Nogo on all transitively imported packages to generate facts, whether or not they're excluded.
Changes proposed
These are all suggestions, not final recommendations. Each needs some investigation.
- Hoist
nogo.jsoninto Starlark, from execution to analysis. Instead of passing it as an input file to each action, we can pass appropriate arguments to the nogo binary. This should reduce cache invalidation when the configuration changes. If we change the set of excluded files in package A, that shouldn't invalidate results in package B. - Only pass facts files for direct imports as inputs; don't include facts for transitive imports. This should reduce cache invalidation when an irrelevant transitively imported package changes. For example, say package A imports B, and B imports
fmt, but A does not. A does not need facts fromfmt. If B defines a Printf wrapper, that will be reflected in B's facts. If B does not define a Printf wrapper, A doesn't care what happens infmt. - Link analyzers into separate binaries; run analyzers as separate actions; write facts to separate files. This may reduce cache invalidation when the set of analyzers changes, or when facts change in a way that transitively affects a large subset of packages.
- Caveat: this needs more thought and careful analysis. A major expansion of the action graph would increase Bazel server memory requirements. There's also overhead for each action executed, and these would be very small actions.
- This might make sense when there are a large number of analyzers that could be sharded into a smaller number of binaries / actions / files.
One more idea that you could consider - keep the nogo.json file as-is (to avoid forcing a migration on users) but add an intermediate action for each package that removes annotations from the config file that don't apply to the current package. That should be a cheap action and then you have higher chance to get a cache hit on the nogo action itself.
An extra action is a possibility, but we may also be able to do something clever with a module extension or repository rule, translating an existing nogo.json file into Starlark. It would be good to preserve existing configuration files if we can.
add an intermediate action for each package that removes annotations from the config file that don't apply to the current package.
Determining whether a given regex can match a suffix of a given string is unfortunately not so easy. We would need to run an action per Go target, which doesn't seem unreasonable.
but we may also be able to do something clever with a module extension or repository rule, translating an existing nogo.json file into Starlark
This runs into the problem that there is no way to match regular expressions in Starlark. While we could probably implement a subset that's sufficiently useful, it would be difficult to document what works and what doesn't.
I'm thinking of the following compromise:
- Add
per_analyzer_included_filesandper_analyzer_excluded_filesattributes (names very much TBD) to thenogorule that are of typestring_list_dict, matchingonly_filesandexclude_filesfrom the JSON format. Also addanalyzer_flagsso that we can get rid of the JSON file entirely, but that's a trivial change we can ignore for now. - Make the patterns for included and excluded files globs, not regexes. Globs are reasonably powerful and pretty simple to efficiently match in Starlark (there is even an implementation in skylib that we could reuse).
- Crucially, apply the filters at analysis time and only communicate the relevant result to
nogoas additional arguments (say, list only the files to skip per analyzer or entire analyzers to skip). - Provide a migration tool that consumes a
nogo.jsonand emits copy-pastable Starlark. If any of your regexes are not easily converted to globs, the tool would leave fixing that to you.
This would also improve usability - compared to JSON, Starlark has comments and you can share common values.
We should also think about whether we can resolve https://github.com/bazel-contrib/rules_go/issues/4284 along the way. We may be able to work with @my_external_repo//style/**/gl*obs and resolve them to the corresponding external/my_external_repo+canonical+name/style/**/gl*obs patterns.
👋 I think we might be hitting this issue.
Take one of our fairly small builds (596 targets), and we're seeing a huge performance hit with nogo enabled with rules_go>v0.53.0.
This build is a mix of Go, Protobuf, custom code generation, and OCI image rules. When performing a build without any caches, here are some numbers:
| rules_go | without nogo | with nogo |
|---|---|---|
| v0.53 | 402s (8705 actions) | 415s (10738 actions) |
| v0.55 | 388s (8705 actions) | 1366s (16299 actions) |
We run ~200 analyzers (from golang.org/x/tools,honnef.co/go/tools/staticcheck, and some custom ones). As you can see, the slowdown with nogo after upgrading is pretty dramatic. This causes significant developer frustrations as it slows down their builds and has some poor interactions with gopls+vscode integrations.
I'm happy to open another issue to talk about this specific issue if desired.
@Geethree Does this only affect the first, full build after the update or also subsequent builds? We are running nogo on more targets now, but most of those runs should be cached after the first build unless you update dependencies.