bazel-gazelle icon indicating copy to clipboard operation
bazel-gazelle copied to clipboard

Some go_repository gazelle generations are slow, could be pre-computed or cached

Open alexeagle opened this issue 2 years ago • 18 comments

A common pathological case is @com_github_docker_docker which can spend hundreds of seconds generating BUILD files. Users are stuck waiting for this to run anytime the repository rule is invalidated.

@siggisim had the clever solution of checking in the resulting generated BUILD files rather than computing on-the-fly, using a script that converts to http_archive like https://github.com/buildbuddy-io/buildbuddy/blob/14ac4f6d9e3df578e31c129add910d176a63d075/deps.bzl#L876-L884

On Slack, @sluongng proposes the results are cachable: https://bazelbuild.slack.com/archives/CDBP88Z0D/p1647012317774269

naively speaking, a Gazelle cache is fairly easy to implement: Enable/disable by setting an environment variable pointing to a path Only cache if the go_repository SHA256 is set, use it as the cache key Under $HOME/.cache/gazelle, store the patch file under ./<sha256>.patch If cache not found, generate then store in cache

alexeagle avatar Mar 11 '22 17:03 alexeagle

The problem is compounded by the fact that Bazel reports this time as "Fetching", e.g. Fetching @com_github_docker_docker; fetching 400s which sends users down a completely wrong path for how to remediate it.

https://github.com/bazelbuild/bazel-gazelle/pull/1191 fixes that by adding progress reporting, thanks @sluongng for the idea.

alexeagle avatar Mar 11 '22 17:03 alexeagle

Under $HOME/.cache/gazelle, store the patch file under ./.patch

Thinking on this a bit more, it's worth noting that this design is exclusively solving this for Golang and not for other languages / extensions. So perhaps a more appropriate cache path would be $HOME/.cache/gazelle/go/<sha256>.patch or alternatively $HOME/.cache/gazelle/go/h1:<sha256-base64>.patch if we are using the sum value in go.sum file.

So the name of the environment variable flag would probably need to be something Go specific as well.

Other Gazelle extensions / languages may have their own mechanism of generating external dependencies so it's not a good idea to try to cover for that use case. Instead, we can document that external dependencies can be large and file-walking can be an expensive operation. Therefore, extension's author are encouraged to implement their own language BUILD file generation cache for such cases. 🤔

sluongng avatar Mar 11 '22 17:03 sluongng

So if I am understanding the proposal correctly, it is to cache a patch file of Gazelle generations for each repository, with sha256/sum being the key.

Aren't the majority of changes to a go_repository rule caused by a version change (and corresponding sha256/sum change)? Meaning this cache would only be helpful when a go_repository rule is invalidated due to other (hopefully less frequent) reasons, such as a Bazel/gazelle upgrade or a change in the repo config (WORKSPACE dependency is added or removed).

How frequently are you seeing go_repository's rules being re-evaluated outside of sha256/sum changes? Maybe there is a bug or something about your repo's configuration that is causing them to be re-evaluated more often than they should?

blico avatar Mar 11 '22 17:03 blico

@blico You raised a good point here.

In most cases today, such BUILD file generation cache would not be useful at all. Thats because Gazelle file walk is relatively fast for most go repositories.

However for cases such as https://github.com/docker/docker, https://github.com/googleapis/google-api-go-client or a small internal project depending on a big internal monorepo, file walk is expensive. So if we can just eliminate the need to walk files for these external dependencies then it would be a lot better.

Here is an example setup: generate these cache file in CI, then for each Github Codespaces / GitPod dev environment, download these cache from CI and pre-seed them. Then using this cache in combination with GO_REPOSITORY_USE_HOST_CACHE would cut down the time for go_repository to almost instance. This is especially relevant when you have to cater for audience who don't fully trust Bazel cache and have the tendency to bazel clean --expunge && rm -rf ~/.cache/bazel when run into issues.

Similar to GO_REPOSITORY_USE_HOST_CACHE, I think this BUILD gen cache is best to leave as optional opt-in feature.

sluongng avatar Mar 11 '22 18:03 sluongng

Heh, coincidentally I was just looking at hellish Docker dependencies. I found the best way to deal with this is just to use the vendor folder, which can be achieved by downloading an archive rather than using go mod download.

go_repository(
    name = "com_github_docker_cli",
    build_file_proto_mode = "disable_global",
    importpath = "github.com/docker/cli",
    sha256 = "5aa34f34b2c6f21ade2227edd08b803d972842d1abd5e70d7e4e758f966e4e3c",
    strip_prefix = "cli-20.10.13",
    urls = ["https://github.com/docker/cli/archive/refs/tags/v20.10.13.tar.gz"],
)  # keep

go_repository(
    name = "com_github_docker_docker",
    build_file_proto_mode = "disable_global",
    importpath = "github.com/docker/docker",
    sha256 = "bd82205bbb39e835c4e9c691c9d5bb8dea40ace847f80b030c736faba95b0b67",
    strip_prefix = "moby-20.10.13",
    urls = ["https://github.com/moby/moby/archive/refs/tags/v20.10.13.tar.gz"],
)  # keep

I wonder if could also make sense to rename the files vendor.mod and vendor.sum to go.mod and go.sum respectively? I don't know how well Gazelle would deal with that since it would not be included in update-repos...

Anyway, I hope the above snippet is helpful for some.

uhthomas avatar Mar 11 '22 19:03 uhthomas

Thinking a bit further on this topic:

An obvious downside of implementing such cache on the starlark level would be that: changes in Gazelle's build config(or repo config) would not be recognized. I.e. updating a workspace directive such as # gazelle:repository_macro would be really hard to detect.

To get build config to be part of cache key compute, we would need to implement the cache in the Go layer of gazelle, when it's called with gazelle -go_repository_mode and perhaps, with an additional flag such as -enable_build_file_cache <cache-dir>.

This would definitely increase the complexity of such implementation significantly, but perhaps the accuracy is desirable as updating a gazelle directive should invalidate the cache.

sluongng avatar Mar 13 '22 13:03 sluongng

@uhthomas could you explain a bit more how that helps to avoid using go mod download? Doesn't gazelle still spend just as long generating the BUILD files? Or is that instead a workaround for the repository invalidation?

alexeagle avatar Mar 13 '22 14:03 alexeagle

I'm actually not too sure how long Gazelle spends generating those build files but I can't imagine it's very long. Here's a benchmark comparing github.com/moby/moby with go mod download versus vendor.

Rules

go_repository(
    name = "com_github_docker_docker",
    build_file_proto_mode = "disable_global",
    importpath = "github.com/docker/docker",
    sha256 = "bd82205bbb39e835c4e9c691c9d5bb8dea40ace847f80b030c736faba95b0b67",
    strip_prefix = "moby-20.10.13",
    urls = ["https://github.com/moby/moby/archive/refs/tags/v20.10.13.tar.gz"],
)  # keep

go_repository(
    name = "com_github_docker_docker_mod",
    build_file_proto_mode = "disable_global",
    importpath = "github.com/docker/docker",
    replace = "github.com/moby/moby",
    sum = "h1:EzlPdbJmYtz9QwFkLNvnEF8LoVFddaBzBsJjBH9eMHc=",
    version = "v20.10.13+incompatible",
)

Benchmark

❯ hyperfine -p 'bazel clean --expunge' -w 1 'bazel build @com_github_docker_docker//api' 'bazel build @com_github_docker_docker_mod//api' --export-markdown out.md
Benchmark 1: bazel build @com_github_docker_docker//api
  Time (mean ± σ):     79.276 s ±  5.773 s    [User: 0.070 s, System: 0.116 s]
  Range (min … max):   71.379 s … 84.786 s    10 runs
 
Benchmark 2: bazel build @com_github_docker_docker_mod//api
  Time (mean ± σ):     367.550 s ± 85.166 s    [User: 0.079 s, System: 0.585 s]
  Range (min … max):   292.416 s … 574.268 s    10 runs
 
Summary
  'bazel build @com_github_docker_docker//api' ran
    4.64 ± 1.13 times faster than 'bazel build @com_github_docker_docker_mod//api'
Command Mean [s] Min [s] Max [s] Relative
bazel build @com_github_docker_docker//api 79.276 ± 5.773 71.379 84.786 1.00
bazel build @com_github_docker_docker_mod//api 367.550 ± 85.166 292.416 574.268 4.64 ± 1.13

The times are inclusive of fetching and building both the Go SDK and Gazelle, so 70s is really not bad for vendor.

It seems like the Go toolchain just gets really confused when presented with github.com/docker/docker and spends ages trying to resolve dependencies. The output will typically spam thousands of lines of go: downloading .... See https://gist.github.com/uhthomas/21dba34284bc9365178aeb78543d8702.

uhthomas avatar Mar 13 '22 20:03 uhthomas

so 70s is really not bad for vendor

I think this is subjective: (1) how significant is 70s and (2) actual value can varies depends on File System performance

For (1), I would like to note that in my case where I try to optimize P50 of a CI pipeline to stay below 5 minutes, a constant cost of 70s could be quite significant (23% increase). I can imagine how other folks trying to optimize for P90 could look at 70s increment with 🙄 .


For (2) here is some additional stat:

I used

http_archive(
    name = "com_github_docker_docker_with_patch",
    patch_args = [
        "-s",
        "-p0",
    ],
    patches = ["@//:com_github_docker_docker.patch"],
    sha256 = "358b4cdf0e2f11b7a99087e349b41d0f5a0a5b9da83b0ef0609ad08160bcd5f0",
    strip_prefix = "moby-363e9a88a11be517d9e8c65c998ff56f774eb4dc",
    urls = ["https://github.com/docker/docker/archive/363e9a88a11be517d9e8c65c998ff56f774eb4dc.zip"],
)

load("@bazel_gazelle//:deps.bzl", "go_repository")

go_repository(
    name = "com_github_docker_docker",
    build_file_proto_mode = "disable_global",
    importpath = "github.com/docker/docker",
    sha256 = "358b4cdf0e2f11b7a99087e349b41d0f5a0a5b9da83b0ef0609ad08160bcd5f0",
    strip_prefix = "moby-363e9a88a11be517d9e8c65c998ff56f774eb4dc",
    urls = ["https://github.com/docker/docker/archive/363e9a88a11be517d9e8c65c998ff56f774eb4dc.zip"],
)  # keep


> bazel build --nobuild '@com_github_docker_docker//...' --profile=profile.json

Both targets should use the Bazel's native downloader and cache the zip in repository_cache. The patch file is copied paste from BuildBuddy's repository

Here is what the profiling looks like

image

Note that the purple stack on the left is com_github_docker_docker_with_patch, here it is zooming in image

Some stats:

//external:com_github_docker_docker_with_patch duration 1s 932ms 240us
download_and_extract 1s 250ms 845us
patch 649ms 364us

//external:com_github_docker_docker duration 36s 252ms 314us
download_and_extract 1s 166ms 577us
env_execute 35s 33ms 298us

So Gazelle scales linearly with the work tree complexity, the more files/directory, the more syscalls Gazelle and related extensions need to make to the File System and thus, can be expensive. Especially with network file system with limited bandwidth such as EBS or NFS, these slowness can be amplified significantly.

# Directory count
> fd --type d '.*' $(bazel info output_base 2>/dev/null)/external/com_github_docker_docker | wc -l
    1304

# Bazel package count    
> for i in $(seq 1 10); do 
  echo "depth: $i"; 
  fd --min-depth $i --max-depth $i BUILD.bazel $(bazel info output_base 2>/dev/null)/external/com_github_docker_docker |\
    wc -l; 
done

depth: 1
       0
depth: 2
      25
depth: 3
     119
depth: 4
      89
depth: 5
     135
depth: 6
     246
depth: 7
     256
depth: 8
      91
depth: 9
      30
depth: 10
       0

Caching the result into a patch file essentially helps switching the generation to constant time, which can be desirable for some use cases.

sluongng avatar Mar 14 '22 10:03 sluongng

@sluongng You mention you are optimizing for your p50 CI times here, do your CI nodes keep a Bazel output_base cache between runs, or does every run use a "fresh" cache? To add to my earlier post, Gazelle should only have to generate build files on fresh nodes, or when a rule has been invalidated, meaning it ideally shouldn't contribute to your p50 latency.

To optimize go_repository rules on "fresh CI nodes" I see value in pre-warming them with the "patch cache" that you are proposing. Although, I do wonder if a more general approach would be more suitable here. For example, all of Bazel's output_base/external directory could be pre-loaded.

@uhthomas Since the majority of the time spent for github.com/docker/docker is trying to resolve dependencies, I think https://github.com/bazelbuild/bazel-gazelle/issues/602 would be a good optimization that I can prioritize adding. tldr; We can make it so go_repository rules do not need to make network calls when resolving dependencies.

blico avatar Mar 14 '22 18:03 blico

Invalidations can be pretty frequent if you're keeping gazelle and rules_go up-to-date in your repo. Then engineers switch back and forth between branches where one is rebased after such an update and the other isn't. Then they see this Fetching @com_github_docker_docker blocking their work a good portion of the time.

alexeagle avatar Mar 14 '22 18:03 alexeagle

I tried to mimic the buildbuddy vendor.sh script, however I found that no matter how precisely I matched up the version, it caused different code to run than the go_repository had fetched.

So I made a different recipe for the workaround documented here: https://github.com/bazelbuild/bazel-gazelle/pull/1198 which is basically just to hack the existing spot where Gazelle generation is run and have it dump the result into a patch file you can copy back out.

alexeagle avatar Mar 14 '22 23:03 alexeagle

With #1201 it took me only 16s to cleanly fetch github.com/docker/docker. Although not as fast as the patch file approach, the PR provides an immediate speed-up for all go_repository rules without requiring any additional overhead (checking in patch files, etc.).

@alexeagle @uhthomas @sluongng do you want to give that PR a try?

blico avatar Mar 16 '22 00:03 blico

@blico that's fantastic, it did seem like just a needless de-optimization happening.

I tested your PR, get this timing locally:

# invalidate the go_repository, e.g. add `build_file_name = "BUILD.bazel,BUILD,invalidate1"`
% time bazel fetch @com_github_docker_docker//api
INFO: All external dependencies fetched successfully.
Loading: 20 packages loaded
bazel fetch @com_github_docker_docker//api  0.05s user 0.07s system 0% cpu 18.656 total

compared with the original network access de-opt

% time bazel fetch @com_github_docker_docker//api
Loading: 0 packages loaded
    Fetching @com_github_docker_docker; Running gazelle to generate BUILD files. Too slow? See https://github.com/bazelbuild/bazel-gazelle/issues/1190 159s
INFO: All external dependencies fetched successfully.
Loading: 1 packages loaded
bazel fetch @com_github_docker_docker//api  0.05s user 0.07s system 0% cpu 2:10.30 total

and compared with the checked-in patch file

% time bazel fetch @com_github_docker_docker//api
INFO: All external dependencies fetched successfully.
Loading: 1 packages loaded
bazel fetch @com_github_docker_docker//api  0.04s user 0.05s system 0% cpu 17.775 total

So I think you've totally solved this issue 🥳

alexeagle avatar Mar 16 '22 21:03 alexeagle

I have done an investigation of this docker fetch issue. In my understanding the situation is as follows:

  • my project use only small part of github.com/docker/docker package;
  • because github.com/docker/docker have not go.mod file, go mod commands don't see whole module dependency graph;
  • go mod tidy on my project adds to go.mod and go.sum transitive dependencies to little part of github.com/docker/docker dependencies;
  • gazelle generates WORKSPACE based by go build info (provided by go mod download -json command) and have incomplete github.com/docker/docker module dependency list (just enough to build my project). But gazelle generate BUILD files for whole github.com/docker/docker module;
  • for every unknown package gazelle call go get -d command and spend a lot of time (in my case summary it takes ~20 min).

As result: gazelle spends a lot of time resolving transitive dependencies from unused third-party code.

My workaroud:

  • collect go get -d packages by #1170;
  • add this "unknown" dependencies to file with never-build go build marker like (this construncitons add missing dependencies to go.mod file):
//go:build tools
// +build tools

package tools

import (
	_ "github.com/moby/buildkit"
	_ "github.com/olivere/env"
	_ "rsc.io/letsencrypt"
	...
)

bozaro avatar Mar 28 '22 05:03 bozaro

Theoretically, Gazelle can simply ignore unknown dependencies without executing the go get -d command (this dendencies is not provided by go.mod, because located on unused packages). As I understand, this should be correct with a dependency list based on a go.mod file, but it may break the behavior in other exotic cases.

bozaro avatar Mar 28 '22 05:03 bozaro

Theoretically, Gazelle can simply ignore unknown dependencies without executing the go get -d command (this dendencies is not provided by go.mod, because located on unused packages).

Today I add patch replacing defaultModInfo function by stub:

func defaultModInfo(rc *RemoteCache, importPath string) (modPath string, err error) {
	return "", nil
}

It works perfectly on our go.mod-based project.

bozaro avatar Apr 14 '22 14:04 bozaro

I tried to understand when it is useful to resolve an unknown import package. And couldn't come up with:

  • if the code with unknown import is not used, then it does not matter
  • if the code with unknown import is used, then we will still fall during compilation, since this repository is not in WORKSPACE.

bozaro avatar Apr 14 '22 19:04 bozaro