rules_go icon indicating copy to clipboard operation
rules_go copied to clipboard

building with race detection (`@io_bazel_rules_go//go/config:race`) is very slow especially if using Go binaries to generate code

Open rickystewart opened this issue 2 years ago • 10 comments

Over at Cockroach we are having this issue where building a go_test with --@io_bazel_rules_go//go/config:race takes a very long time. The fact that the stdlib needs to be recompiled with race detection is a problem, but more relevant seems to be that building a target with race detection invalidates the cache for all go_binarys used to produce that target, including go_binarys that are built as exec_tools for a genrule, and causes all of them to be re-built with race detection. If you use Go code to generate other Go code (as we do), the problem is now twofold:

  • previous builds of the go_binary for code generation cannot be reused, so the binary (and presumably all or most of its dependencies) needs to be rebuilt with race detection enabled (a slower build process in and of itself).
  • because of the overhead imposed by the race detector, running the binary itself becomes much slower (potentially for no reason as I probably don't care about races in code generators, and probably only care about the code under test).

Since our code generators produce identical output regardless of whether race detection is enabled, it's very wasteful to re-build the code generators and then run a much slower version of the binary as part of the race build.

Presumably this behavior is OK by default, but there should be a workaround. I have tried updating the go_binary declaration for these code generating binaries to have race = "off" but it appears to have no effect -- setting the value on the command line overrides everything in BUILD files. It would be sufficient if I could have a way to specify that certain targets really do not need to be built with race detection despite the command-line flags.

What version of rules_go are you using?

0.32

What version of gazelle are you using?

0.25

What version of Bazel are you using?

5.1.0

Does this issue reproduce with the latest releases of all the above?

Yes

What operating system and processor architecture are you using?

Darwin ARM64

Any other potentially useful information about your toolchain?

(None)

What did you do?

Tried setting race = "off" for code-generating go_binary target when building a target that depends on that target as an exec_tool.

What did you expect to see?

Ideally the code-generating binary would not be built with race-detection especially if I set race = "off" manually.

What did you see instead?

Code-generating binary is built with race detection, invalidating the cache and slowing the entire build down.

rickystewart avatar Jun 28 '22 16:06 rickystewart

Just for clarity, the linked issue is https://github.com/cockroachdb/cockroach/issues/81314 as the link in the OP text is broken.

uhthomas avatar Jun 29 '22 14:06 uhthomas

race = "off" not having an effect sounds like a bug in itself. Overall, however, I think that this FR is better served by a different mechanism: We could introduce @io_bazel_rules_go//go/config:tool_race to govern race detection for targets build in the exec configuration.

@rickystewart Given that you are experiencing the pain of the issue, would you be interested in submitting a PR to fix the race = "off" issue and/or add the tool_race flag? If you are interested, I could provide pointers and guidance along the way.

fmeum avatar Jun 29 '22 17:06 fmeum

We encountered this issue in the Figma monorepo.

The fact that the stdlib needs to be recompiled with race detection is a problem

For us, this is the primary issue, as building the stdlib causes Bazel to get OOM-killed on our CI executor (4GB RAM), even with --local_{ram,cpu}_resources configured appropriately. For some reason it seems to get built twice in parallel, once to bazel-out/k8-opt-exec-2B5CBBC6 and once to bazel-out/k8-opt-exec-2B5CBBC6-ST-8f6eb6207b9d.

go test -race does not have this issue; presumably it uses a stdlib prebuilt with race detection. Could rules_go do the same?

jfirebaugh avatar Aug 24 '22 01:08 jfirebaugh

Bazel now has an API for adding resource requirements to actions. It is still experimental, but we could optionally use it for stdlib actions.

I am working on fixing the issue that the stdlib is rebuilt more often than necessary, but that may also require flags.

fmeum avatar Nov 30 '22 07:11 fmeum

We ran into this as well recently. The speed is not an issue thanks to RBE. However, we got this error when race is enabled via a flag.

ERROR: /private/var/tmp/_bazel_sluongng/06e573a93bc2d6a9cad4ad41f00b4310/external/io_bazel_rules_go/BUILD.bazel:42:7: in stdlib rule @@io_bazel_rules_go//:stdlib:
Traceback (most recent call last):
        File "/private/var/tmp/_bazel_sluongng/06e573a93bc2d6a9cad4ad41f00b4310/external/io_bazel_rules_go/go/private/rules/stdlib.bzl", line 33, column 20, in _stdlib_impl
                go = go_context(ctx)
        File "/private/var/tmp/_bazel_sluongng/06e573a93bc2d6a9cad4ad41f00b4310/external/io_bazel_rules_go/go/private/context.bzl", line 458, column 20, in go_context
                mode = get_mode(ctx, toolchain, cgo_context_info, go_config_info)
        File "/private/var/tmp/_bazel_sluongng/06e573a93bc2d6a9cad4ad41f00b4310/external/io_bazel_rules_go/go/private/mode.bzl", line 107, column 13, in get_mode
                fail("race instrumentation can't be enabled when cgo is disabled. Check that pure is not set to \"off\" and a C/C++ toolchain is configured.")
Error in fail: race instrumentation can't be enabled when cgo is disabled. Check that pure is not set to "off" and a C/C++ toolchain is configured.
ERROR: /private/var/tmp/_bazel_sluongng/06e573a93bc2d6a9cad4ad41f00b4310/external/io_bazel_rules_go/BUILD.bazel:42:7: Analysis of target '@@io_bazel_rules_go//:stdlib' failed

The issue goes away if I were to remove the go_binary included in //... that has pure = "on" enabled explicitly.

Setting race = "off" directly on the target does not have any effect. I think the flag --@io_bazel_rules_go//go/config:race would override the setting on the target.


I think a surefire way to fix this is to provide a mechanism to enforce race setting on a binary, regardless of manual flag set. It's probably gonna be a transition on top of the go_binary target to enforce the setting.

I was looking into transition.bzl as well as go_cross_binary for this purpose, but I don't think these rules can enforce config:race to a certain value for my use case. @fmeum do you know if a new transition is required for this purpose? Or we could reuse go_cross_binary somehow?

sluongng avatar Jan 24 '24 12:01 sluongng

I created a quick POC of the transition mentioned above ☝️ here https://github.com/buildbuddy-io/buildbuddy/pull/5748 if anyone wants a quick solution.

sluongng avatar Jan 24 '24 13:01 sluongng

@sluongng I would support a PR that simply ignores the value of race if pure is set to on explicitly instead of failing. Would that allow you to get rid of the additional transition?

fmeum avatar Jan 24 '24 13:01 fmeum

Regarding the original issue (and @rickystewart): I now think that we should probably just disable race for targets built in the exec configuration instead of making this comfortable.

More generally, we would want to revisit and specify how our build settings should change under the exec transition. For example, linkmode can currently affect (that is, break) aspect tools, which is something that @tyler-french ran into.

I see two ways to achieve this today:

  • In go_context, check ctx.bin_dir for the substring -exec- and set other defaults/overrides depending on this.
  • In go_transition, inspect the value of //command_line_option:is exec configuration (no typo, there really are spaces in this name) as a transition input (if this can be used) and modify the transition behavior based on that information.

Contributions would be very welcome.

fmeum avatar Jan 24 '24 13:01 fmeum

@sluongng I would support a PR that simply ignores the value of race if pure is set to on explicitly instead of failing. Would that allow you to get rid of the additional transition?

Would ignoring race value work? or do we need a "corrective" transition to disable race when pure is explicitly set to on? 🤔

Looking at this https://github.com/bazelbuild/rules_go/blob/30099a6add3c43706b4eec82b773b78310874935/go/private/actions/stdlib.bzl#L127-L131

It seems like the diff would be whether we gonna go install runtime/cgo with/without -race flag added. I blindly ran

> GOCACHE=/tmp/ go install -x -race runtime/cgo`

and it seems to just work? 🤔

sluongng avatar Jan 24 '24 14:01 sluongng

oh nvm, I misread it, the diff is when we have both pure and race, the runtime/cgo package will be missing from go stdlib, which is a requirement for race detection tests.

So the question would be, are there any use cases for such a combination: i.e. setting up a race-detection test with pure=on? or could we assume that pure=on will always be set in a non-test binary?

My guess is it's fine to "ignore the value of race if pureis set toonexplicitly" as Fabian suggested. But there could still be the concern of adding-race` to the compile/link actions without needing race detection feature at runtime.

So I think the additional transition would still be a surefire approach to get the "expected" result in this case.

sluongng avatar Jan 24 '24 14:01 sluongng