bazel icon indicating copy to clipboard operation
bazel copied to clipboard

incompatible_enable_cc_toolchain_resolution: Turn on toolchain resolution for cc rules

Open katre opened this issue 6 years ago β€’ 57 comments

Flag: --incompatible_enable_cc_toolchain_resolution Available since: 0.23 Will be flipped in: 6.0.0 Tracking issue: #6516

Current status

C++ rules are to our best knowledge ready to be migrated to platforms. Fat APK binary of Android rules and Multiarch binary for Apple rules are still using the legacy C++ toolchain selection. We expect those to be fixed in Bazel 0.26.

Motivation Currently, cc rules access the cc_toolchain by a combination of the --crosstool_top, --cpu, and --compiler flags. This system is complex, lacks flexibility, and will be removed in favor of using the toolchain resolution system added to Bazel.

When this flag is enabled, the legacy system for accessing the cc toolchain will be disabled, and the cc rules will instead use the toolchain resolution system. Eventually, the legacy code will be removed.

Migration

If you're not using C++, or custom --crosstool_top, --cpu, or --compiler, you don't use select on these options, you can stop reading now, there is nothing to migrate for you.

Build configured by the current C++ toolchain

You will eventually have to migrate selects and conditional logic in Starlark rules to toolchain/platforms. There is no generally applicable migration tool.

But to make incremental migration possible, support for PLATFORM_MAPPING file was added to Bazel. See https://github.com/bazelbuild/bazel/issues/6426 for up to date infomation and documentation.

C++ Toolchain owners

For users who are using --crosstool_top to define custom cc_toolchain targets, it will be necessary to add the corresponding toolchain targets (see the very rudimentary example in Bazel's autoconfigured toolchain), and to register those toolchains (Bazel's autoconfigured toolchain example).

For users who are cross-compiling by using --cpu, it will be necessary to add appropriate platform targets and to use the --platforms flag to ensure targets are built correctly. Until the --cpuflag can be fully deprecated or a migration path can be in place, it will be necessary to set both --cpu and --platforms.

Please use constraint_settings and constraint_values from the canonical Bazel Platforms Repository (don't hesitate to propose missing targets!). If there is a need for C++ specific constraints, feel free to upload a PR to the rules_cc repository. It is extremely important that the whole ecosystem uses the same constraints, we can only reuse libraries and toolchains when we speak the same language.

Starlark rules owners

toolchain_type

For Starlark rules owners who depend on C++ toolchain it will be necessary to declare dependency on C++ toolchain type.

Before:

foo = rule(
    implementation = _foo_impl,
    attrs = {
        "_cc_toolchain": attr.label(default = Label("@bazel_tools//tools/cpp:current_cc_toolchain")),
    },
)

After:

foo = rule(
    implementation = _foo_impl,
    attrs = {
        "_cc_toolchain": attr.label(default = Label("@bazel_tools//tools/cpp:current_cc_toolchain")),
    },
    toolchains = ["@bazel_tools//tools/cpp:toolchain_type"],
  )

In case you're already using Bazel 0.27, save yourself a migration and depend on rules_cc:

    toolchains = ["@rules_cc//cc:toolchain_type"],
find_cpp_toolchain

See the docs and use @rules_cc//cc:find_cc_toolchain.bzl (if using Bazel >= 0.27) or @bazel_tools//tools/cpp:toolchain_utils.bzl to locate current C++ toolchain (otherwise). Also see examples for general usage.

katre avatar Jan 25 '19 17:01 katre

Note that this flag will supercede and replace the previous --enabled_toolchain_types flag, which was introduced before the incompatible change policy was formulated, and which was only ever used for the cc rules, anyway.

katre avatar Jan 25 '19 17:01 katre

People who are interested: @hlopko, @lberki, @nlopezgi.

katre avatar Jan 25 '19 17:01 katre

Will native transitions like https://github.com/bazelbuild/bazel/blob/4b00ab1b97e5711c2fd65eb8150923effac0dae7/src/main/java/com/google/devtools/build/lib/rules/android/AndroidRuleClasses.java#L252 also need attention?

gregestren avatar Jan 29 '19 19:01 gregestren

Yes, @aragos and I have been discussing a plan. We will probably not be able to flip this flag until we have a solution, so I am removing the "Will be flipped in" tag.

katre avatar Jan 29 '19 20:01 katre

Please do not assign issues to more than one team

dslomov avatar Feb 15 '19 15:02 dslomov

When testing with this flag on in rules_go at dd527c7d with Bazel 0.26.0rc8 on macOS, I'm seeing this error:

$ bazel build --incompatible_enable_cc_toolchain_resolution tests/legacy/examples/cgo:sub
> > Loading: 
Loading: 0 packages loaded
Analyzing: target //tests/legacy/examples/cgo:sub (0 packages loaded, 0 targets configured)
INFO: Analyzed target //tests/legacy/examples/cgo:sub (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
bazel: Entering directory `/private/var/tmp/_bazel_jayconrod/95575c789b578fe26b1744a32454af42/execroot/io_bazel_rules_go/'
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: /Users/jayconrod/go/src/github.com/bazelbuild/rules_go/tests/legacy/examples/cgo/BUILD.bazel:29:1: GoCompilePkg tests/legacy/examples/cgo/darwin_amd64_stripped/sub%/github.com/bazelbuild/rules_go/examples/cgo/sub.a failed (Exit 1) builder failed: error executing command bazel-out/host/bin/external/go_sdk/builder compilepkg -sdk external/go_sdk -installsuffix darwin_amd64 -src tests/legacy/examples/cgo/sub/floor.go -importpath ... (remaining 19 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
ld: framework not found UIKit
clang: error: linker command failed with exit code 1 (use -v to see invocation)
compilepkg: error running subcommand: exit status 1
bazel: Leaving directory `/private/var/tmp/_bazel_jayconrod/95575c789b578fe26b1744a32454af42/execroot/io_bazel_rules_go/'
Target //tests/legacy/examples/cgo:sub failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.488s, Critical Path: 0.29s
INFO: 0 processes.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully

It looks like -framework UIKit is being added to the C++ link flags. This target doesn't set link flags on its own, so I think this is coming from cc_common. Is that expected?

jayconrod avatar May 14 '19 19:05 jayconrod

I'm investigating the UIKit issue.

hlopko avatar May 15 '19 07:05 hlopko

Yay for surprises :) So the osx C++ toolchain doesn't yet work with platforms because of this TODO. @katre kindly volunteered to fix it :)

hlopko avatar May 15 '19 12:05 hlopko

--action_env=BAZEL_DO_NOT_DETECT_CPP_TOOLCHAIN=1 cannot be used with --incompatible_enable_cc_toolchain_resolution. I was running some additional tests with --incompatible_enable_cc_toolchain_resolution with RBE (although this does not have to do with RBE, I think), and I was unsure if --incompatible_enable_cc_toolchain_resolution was working as I expected it to work (i.e. specifically that it was using the remote cc_toolchain that I was indicating instead of the local one) I tried to add BAZEL_DO_NOT_DETECT_CPP_TOOLCHAIN=1 but my build failed with the following error:

INFO: ToolchainResolution: Selected execution platform @rbe_default//config:platform, type @bazel_tools//tools/jdk:toolchain_type -> toolchain @bazel_tools//tools/jdk:dummy_toolchain
ERROR: /root/.cache/bazel/_bazel_root/7e958634aed2e0b9513fa7cce861a282/external/local_config_cc/BUILD:28:1: in cc_toolchain_suite rule @local_config_cc//:toolchain: cc_toolchain_suite '@local_config_cc//:toolchain' does not contain a toolchain for cpu 'k8'

Note that without the BAZEL_DO_NOT_DETECT_CPP_TOOLCHAIN flag my builds are running fine, and even when I check with --toolchain_resolution_debug it does seem to be the case that my remote builds are picking the toolchain that I want them to pick and not the local one. But perhaps this means that its time to remove support for BAZEL_DO_NOT_DETECT_CPP_TOOLCHAIN and perhaps provide some other way to not register the local_cc_config (i.e., I really want to be able to run builds with the local_cc_config disabled altogether so that I can be super sure all actions can only run remotely with the toolchain config I explicitly set).

nlopezgi avatar May 15 '19 14:05 nlopezgi

@laurentlb I think the migration-0.25 tag should be removed from this issue because of this issue that was cherry picked into 0.26 https://github.com/bazelbuild/bazel/issues/8330

keith avatar May 20 '19 21:05 keith

@nlopezgi In the error message I see cc_toolchain_suite, that shouldn't be used at all with --incompatible_enable_cc_toolchain_resolution, is it possible that you were building the cc_toolchain_target explicitly? The expected error message is:

ERROR: While resolving toolchains for target //:too: no matching toolchains found for types @bazel_tools//tools/cpp:toolchain_type

Can you describe a repro case if there's still a bug?

I just uploaded https://github.com/bazelbuild/bazel/pull/8459, which AFAIK removes the need for BAZEL_DO_NOT_DETECT_CPP_TOOLCHAIN variable (but I kept it around for cases like you mention - where you want to be super sure local_config_cc doesn't register any toolchain. But the original use case - preventing Bazel from autoconfiguring local C++ toolchain when there are remote toolchains fully configured, is fixed by https://github.com/bazelbuild/bazel/pull/8459 (toolchain targets will still be generated, but system will only be inspected after the C++ toolchain is already selected).

hlopko avatar May 24 '19 12:05 hlopko

Hi Marcel, You can repro this error using as base this GCB build script which tests a very simple cc target (//examples/remotebuildexecution/hello_world/cc:say_hello_test, i.e., I am not building the cc_toolchain_target) The GCB yaml file shows all the flags needed to repro. You will need to change the --remote_instance_name to point to some instance of RBE you have access to (or maybe do a local build inside the bazel container, it probably also can be repro'd there?).

The error reproduces when building with --action_env=BAZEL_DO_NOT_DETECT_CPP_TOOLCHAIN=1 (build log) but does not happen when the --action_env=BAZEL_DO_NOT_DETECT_CPP_TOOLCHAIN=1 is ommitted (build log. See also my comment in #8459 about how this will impact remote execution.

nlopezgi avatar May 24 '19 13:05 nlopezgi

Hi @nlopezgi, the problem with BAZEL_DO_NOT_DETECT_CPP_TOOLCHAIN was fixed by https://github.com/bazelbuild/bazel/pull/8459. Thanks for reporting!

To be explicit - I don't think we need to postpone flipping this flag because of BAZEL_DO_NOT_DETECT_CPP_TOOLCHAIN bug - users can still migrate incrementally in 0.26 (with --incompatible_enable_cc_toolchain_resolution disabled), and they will get the correct behavior in 0.27 (with --incompatible_enable_cc_toolchain_resolution enabled).

hlopko avatar May 28 '19 06:05 hlopko

We found a case with master / 0.27.0 where this doesn't work for iOS builds https://github.com/bazelbuild/bazel/issues/8716

keith avatar Jun 25 '19 22:06 keith

Removing the bazel 1.0 label and extending the migration window.

katre avatar Aug 02 '19 14:08 katre

@katre Hello :) Is this still happening? Seems like 2.x is no longer a migration window for this flag, is this expected?

I actually don't care (or know) so much about this flag itself, but I just noticed that we have a special case for this flag in the bazel_bootstrap_distfile_test, which causes the bootstrap to run twice: https://github.com/bazelbuild/bazel/blob/master/src/test/shell/bazel/bazel_bootstrap_distfile_test.sh#L76

This is too expensive for CI (it takes over 10 minutes on Windows). Is there a cheaper way to test this rather than running a full bootstrap (which builds Bazel twice, serially)?

philwo avatar Jan 26 '20 16:01 philwo

I'll investigate.

katre avatar Jan 27 '20 12:01 katre

So're in a tricky situation:

  1. The Configurability team absolutely wants to flip this flag.
  2. The Rules-C++ team member who was championing this is no longer on the team.
  3. We didn't flip this for Bazel 1.0 because of the follow-on problems with Android and iOS rules.
  4. No one is championing those either.

I'm working on a plan to get these addressed and flipped, but it won't be near-term. So to alleviate CI pain, is it possible to disable the test normally, and run it once per day or so, just to catch any regressions?

(CC @aiuto and @lberki in case I am mis-stating the case for flipping this flag)

katre avatar Jan 27 '20 15:01 katre

As a note, we don't appear to be testing this with the migration test pipeline, is that intentional? See https://buildkite.com/bazel/bazelisk-plus-incompatible-flags/builds/393

katre avatar Jan 27 '20 15:01 katre

We're close to making this possible technically (i.e. all the supporting C++ work is done). The big challenge is 3) above: how to not inadvertently break depending projects that aren't in C++ and don't themselves understand platforms.

We have an interim solution. But as John says we need someone to guide the process so projects that might be affected don't get caught offguard and don't know what to do. Ideally this just involves defining a couple of strategic platform mappings in a common place.

gregestren avatar Jan 27 '20 16:01 gregestren

This test is basically the only reason now why our presubmits are 17 instead of 12 minutes and I don't see any way how to make it faster.

I read everything I could find about this flag and I still don't understand why we have to run a full Bazel bootstrap using compile.sh as part of our tests to verify that this flag works.

Couldn't we alternatively build e.g. a helloworld.cc with this flag enabled to ensure that it works?

philwo avatar Feb 10 '20 11:02 philwo

@philwo : how does that test make our presubmits 5 minutes slower? My mental model is that we need to do at least one bootstrap test and if we do one, doing two in parallel takes the same amount of wall time.

lberki avatar Feb 10 '20 13:02 lberki

@katre : in order to test this with incompatible flags, the migration-* tag needs to be on the bug. I think we just simply forgot to add them.

lberki avatar Feb 10 '20 13:02 lberki

@katre: remind me again, what's the problem with (3)? I thought platform mapping makes this a no-op for Android and iOS and the only thing that needs to be done after the flag flip is to change their select() statements.

That said, we should totally have well-known platform definitions to replace the well-known --cpu values.

lberki avatar Feb 10 '20 13:02 lberki

@lberki We are hardware-limited on CI, especially on macOS. Anything we can do to cause the tests to use less hardware resources has massive impact. For example, with my "fix our test framework to not extract 500 JDKs per run" change, I moved us from being so insanely I/O bound that the only way to run our tests was in a RAM disk to being no longer I/O bound at all. Thus we were able to get rid of RAM disks (= switch from highmem to cheaper standard machines) and get rid of the local SSDs (save money and complexity, allows use of machines with faster CPUs that do not have SSDs available).

Another little two line fix then got our postsubmit on Linux down from 50 to 35 minutes on Linux. This high impact is all only possible, because our code and tests eat resources like there's no tomorrow.

This test is the top 1 on my list of "slowest tests on the CI".

A Bazel bootstrap easily occupies the entire machine's capacity (all cores, lots of memory, much I/O). While it does so, the processing power spent on this is not available for other tests. Thus, by running two bootstraps, you occupy the machine's entire capacity twice for the duration of the test. There's nothing gained by running them in parallel - a Bazel bootstrap already is "parallel", because it's Bazel (which is quite good at parallel execution ^^) building itself.

I think we should just remove this test_bootstrap_with_cc_rules_using_platforms test. I talked with Marcel about it, and this test is not to test that the flag works, but it's to test that our code base can still be built with that flag on. That's nice, but that's exactly why we have the Bazelisk migration pipeline on CI, which runs in the night, when using a lot of capacity is fine.

WDYT? If you agree, I'll go ahead and remove the test, so that we just rely on the Bazelisk pipeline. If not, I'd disable it for presubmit, but note that postsubmits aren't free either - we only have so many Macs and while they're running a postsubmit test, they can't run a presubmit.

philwo avatar Feb 10 '20 13:02 philwo

I was thinking we could run the two bootstraps in two different machines and take less wall time that way.

As long as bootstrapping Bazel with the flag set is tested by the incompatible flag pipeline, I'm fine with removing the test. Does it run the bootstrap with the incompatible flags set, though? I'm not sure what happens with the "inner" Bazel instances (the ones that are being tested, not the one that runs the tests)

lberki avatar Feb 10 '20 14:02 lberki

We cannot just run the two bootstraps on different machines, because we simply do not have enough Mac machines. πŸ’ΈπŸ˜€

If this flag is marked as migration ready, then it will be included in the "Bazelisk + Incompatible flags" pipeline. I just verified what that actually does. In that case, it would run a bazel build --incompatible_enable_cc_toolchain_resolution //src:bazel where bazel is the latest release and the source code it tries to build is Bazel HEAD. Which means, it verifies that the latest Bazel release with that flag flipped can still build Bazel from scratch.

Unless there's some specific reason why we have to test this flag using ./compile.sh, it seems like this is exactly what we want to test. WDYT?

philwo avatar Feb 10 '20 14:02 philwo

SGTM. As long as the code path is being tested regularly to prevent rot, I am fine with this.

katre avatar Feb 10 '20 14:02 katre

Sounds like a plan. AFAIU the bootstrapping process relies on Bazel itself to build C++ code and the only part that's special-cased is Java compilation, so it should be all fine.

lberki avatar Feb 10 '20 15:02 lberki

The most recent status here is that the flag is ready to flip, except that Android and Apple rules don't handle platforms well, so there's no clear way to manage those transitions. Configurability is working with the Android rule owners to fix those rules, and then we'll look into Apple rules when that's in place.

katre avatar May 11 '20 16:05 katre