rules_docker icon indicating copy to clipboard operation
rules_docker copied to clipboard

Bazel CI: rules_docker still failing with Bazel@HEAD

Open meteorcloudy opened this issue 2 years ago • 18 comments

https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/2290#7782e19b-d082-4fa0-9ca6-ba4c6413740b

(04:03:10) ERROR: /var/lib/buildkite-agent/builds/bk-docker-9znq/bazel-downstream-projects/rules_docker/tests/container/BUILD:817:16: While resolving toolchains for target //tests/container:alpine_arch_ppc64le: no matching toolchains found for types //toolchains/docker:toolchain_type
(04:03:10) ERROR: Analysis of target '//tests/container:architecture_test' failed; build aborted:

A bisect shows the breaking change is: https://github.com/bazelbuild/bazel/commit/98d376faeb206f14838156ce4cb305ddbfce08fa

I suspect it has something to do with how platform transition is defined here: https://github.com/bazelbuild/rules_docker/blob/4c6d61c356aed293ea18c30d7dc50cfa609a3cf6/platforms/BUILD#L60-L84

meteorcloudy avatar Dec 22 '21 12:12 meteorcloudy

/cc @brandjon Can you help advise how to fix this?

meteorcloudy avatar Dec 22 '21 12:12 meteorcloudy

FYI @uhthomas

meteorcloudy avatar Jan 05 '22 08:01 meteorcloudy

@brandjon ping, do you have any idea what's happening here?

meteorcloudy avatar Jan 14 '22 09:01 meteorcloudy

Since https://github.com/bazelbuild/bazel/commit/98d376faeb206f14838156ce4cb305ddbfce08fa is in Bazel 5.0, this means rules_docker has to fix this issue to be able to work with the next Bazel release.

meteorcloudy avatar Jan 14 '22 09:01 meteorcloudy

I've taken a quick look.

❯ git switch -d 76c708fc979c1bfb65b4db300c654be08f096874
❯ USE_BAZEL_VERSION=98d376faeb206f14838156ce4cb305ddbfce08fa bazel test //... --toolchain_resolution_debug
...
INFO: ToolchainResolution:     Type //toolchains/docker:toolchain_type: target platform @io_bazel_rules_docker//platforms:image_transition: Rejected toolchain @docker_config//:toolchain; mismatching values: linux
INFO: ToolchainResolution:     Type //toolchains/docker:toolchain_type: target platform @io_bazel_rules_docker//platforms:image_transition: Rejected toolchain @docker_config//:toolchain; mismatching values: windows
...
INFO: ToolchainResolution:     Type //toolchains/docker:toolchain_type: target platform @io_bazel_rules_docker//platforms:image_transition: Rejected toolchain @docker_config//:toolchain; mismatching values: osx
...
INFO: ToolchainResolution:   Type //toolchains/docker:toolchain_type: target platform @io_bazel_rules_docker//platforms:image_transition: No toolchains found.

I'll take a deeper look later today to understand what's happening.

uhthomas avatar Jan 14 '22 11:01 uhthomas

I'm confused.

The linked commit (https://github.com/bazelbuild/bazel/commit/98d376faeb206f14838156ce4cb305ddbfce08fa) is from January 2021, over a year ago. Has it only recently been merged? If not, what has now caused this problem?

Whilst debugging I found that unrelated, seemingly random, targets and tests fail. For example:

ERROR: /home/thomas/code/github.com/uhthomas/rules_docker/tests/contrib/BUILD:137:17: in container_bundle_ rule //tests/contrib:create_empty_bundle:
Traceback (most recent call last):
	File "/home/thomas/code/github.com/uhthomas/rules_docker/container/bundle.bzl", line 67, column 15, in _container_bundle_impl
		_incr_load(
	File "/home/thomas/code/github.com/uhthomas/rules_docker/container/layer_tools.bzl", line 232, column 28, in incremental_load
		run_tag = images.keys()[0]
Error: index out of range (index is 0, but sequence has 0 elements)
ERROR: Analysis of target '//tests/contrib:create_empty_bundle' failed; build aborted: Analysis of target '//tests/contrib:create_empty_bundle' failed

In regard to Docker toolchain resolution, I believe the //toolchains/docker:toolchain_type toolchains should use exec_compatible_with rather than target_compatible_with. This solves the original issue, but raises new ones like https://github.com/bazelbuild/bazel/issues/8751.

I suspect that we should make a patch to disable transitioning by default as it appears that Bazel just isn't ready for it.

uhthomas avatar Jan 14 '22 16:01 uhthomas

The linked commit (bazelbuild/bazel@98d376f) is from January 2021, over a year ago. Has it only recently been merged? If not, what has now caused this problem?

Yes, the commit is very old, but not included in Bazel 4.x release (our first LTS release). 5.0 is coming out very soon and will contain this change. rules_docker is broken by this commit with Bazel@HEAD for a long time, but it's only reported here recently.

/cc @katre @gregestren Can you help with this issue?

meteorcloudy avatar Jan 17 '22 12:01 meteorcloudy

Do you have a simplest build that demonstrates the failure, aside from shown above?

A quick look tells me if https://github.com/bazelbuild/bazel/commit/98d376f is causing this it'd have to be some combination of a Starlark transition being applied and a user-defined build flag that might not be defined in the same repo as where the build is happening.

Do the failures involve any flags?

The next step I'd try to diagnose is to run a bazel cquery deps(//:target_im_building) before and after . Identify the failing target's configuration hash and run bazel config <that hash>. See if any flag values are different as a result of https://github.com/bazelbuild/bazel/commit/98d376f. That could help identify if https://github.com/bazelbuild/bazel/commit/98d376f actually changes any configurations anywhere. If not, I don't see how https://github.com/bazelbuild/bazel/commit/98d376f could cause toolchain resolution errors.

But happy to work with you to diagnose better.

gregestren avatar Jan 17 '22 16:01 gregestren

Whilst debugging I found that unrelated, seemingly random, targets and tests fail. For example:

In regard to Docker toolchain resolution, I believe the //toolchains/docker:toolchain_type toolchains should use exec_compatible_with rather than target_compatible_with. This solves the original issue, but raises new ones like bazelbuild/bazel#8751.

@uhthomas Are the other errors caused by the same issue? I'm seeing the same error with bazel 4.2.2:

$ bazel version
Build label: 4.2.2

$ bazel build //tests/contrib:create_empty_bundle
ERROR: /usr/home/greg/bazel/rules_docker/tests/contrib/BUILD:137:17: in container_bundle_ rule //tests/contrib:create_empty_bundle:
Traceback (most recent call last):
	File "/usr/home/greg/bazel/rules_docker/container/bundle.bzl", line 67, column 15, in _container_bundle_impl
		_incr_load(
	File "/usr/local/home/greg/bazel/rules_docker/container/layer_tools.bzl", line 232, column 28, in incremental_load
		run_tag = images.keys()[0]
Error: index out of range (index is 0, but sequence has 0 elements)
ERROR: Analysis of target '//tests/contrib:create_empty_bundle' failed; build aborted: Analysis of target '//tests/contrib:create_empty_bundle' failed

Would I expect that to work?

(I'm trying to replicate the CI command from https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/2290#7782e19b-d082-4fa0-9ca6-ba4c6413740b but haven't yet gotten docker properly set up on my machine to work with any version)

gregestren avatar Jan 18 '22 21:01 gregestren

@gregestren To reproduce:

docker run -it --init gcr.io/bazel-public/ubuntu1804-java11
root@4f4a89faff2e:/# mkdir workdir
root@4f4a89faff2e:/# cd workdir/
root@4f4a89faff2e:/workdir# git clone https://github.com/bazelbuild/rules_docker.git
root@4f4a89faff2e:/workdir# cd rules_docker/
root@4f4a89faff2e:/workdir/rules_docker# export USE_BAZEL_VERSION=98d376faeb206f14838156ce4cb305ddbfce08fa
root@4f4a89faff2e:/workdir/rules_docker# bazel build //tests/container:alpine_arch_ppc64le
2022/01/19 10:50:30 Using unreleased version at commit 98d376faeb206f14838156ce4cb305ddbfce08fa
2022/01/19 10:50:30 Downloading https://storage.googleapis.com/bazel-builds/artifacts/ubuntu1404/98d376faeb206f14838156ce4cb305ddbfce08fa/bazel...
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
DEBUG: /root/.cache/bazel/_bazel_root/d0f48cfc39bf7313c85e758b7dac1933/external/bazel_toolchains/rules/rbe_repo/version_check.bzl:59:14:
Current running Bazel is not a release version and one was not defined explicitly in rbe_autoconfig target. Falling back to '4.0.0'
ERROR: While resolving toolchains for target //tests/container:alpine_arch_ppc64le: no matching toolchains found for types //toolchains/docker:toolchain_type
ERROR: Analysis of target '//tests/container:alpine_arch_ppc64le' failed; build aborted: no matching toolchains found for types //toolchains/docker:toolchain_type
INFO: Elapsed time: 16.516s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (43 packages loaded, 205 targets configured)

If you set USE_BAZEL_VERSION to https://github.com/bazelbuild/bazel/commit/98d376f 's parent commit, it works:

root@4f4a89faff2e:/workdir/rules_docker# export USE_BAZEL_VERSION=6d8f0671cb0c9456d2a95d8a54fcd0453854b255
root@4f4a89faff2e:/workdir/rules_docker# bazel build //tests/container:alpine_arch_ppc64le
2022/01/19 10:51:26 Using unreleased version at commit 6d8f0671cb0c9456d2a95d8a54fcd0453854b255
2022/01/19 10:51:26 Downloading https://storage.googleapis.com/bazel-builds/artifacts/ubuntu1404/6d8f0671cb0c9456d2a95d8a54fcd0453854b255/bazel...
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
DEBUG: /root/.cache/bazel/_bazel_root/d0f48cfc39bf7313c85e758b7dac1933/external/bazel_toolchains/rules/rbe_repo/version_check.bzl:59:14:
Current running Bazel is not a release version and one was not defined explicitly in rbe_autoconfig target. Falling back to '4.0.0'
INFO: Analyzed target //tests/container:alpine_arch_ppc64le (113 packages loaded, 7302 targets configured).
INFO: Found 1 target...
Target //tests/container:alpine_arch_ppc64le up-to-date:
  bazel-out/k8-fastbuild-ST-15abc339c81c/bin/tests/container/alpine_arch_ppc64le-layer.tar
INFO: Elapsed time: 24.352s, Critical Path: 1.66s
INFO: 49 processes: 17 internal, 32 processwrapper-sandbox.
INFO: Build completed successfully, 49 total actions

meteorcloudy avatar Jan 19 '22 10:01 meteorcloudy

Thanks @meteorcloudy. I'm wondering if we need to do more than

In regard to Docker toolchain resolution, I believe the //toolchains/docker:toolchain_type toolchains should use exec_compatible_with rather than target_compatible_with. This solves the original issue

from https://github.com/bazelbuild/rules_docker/issues/1988#issuecomment-1013294961? Or at least if the remaining problems are caused by the same code?

gregestren avatar Jan 19 '22 13:01 gregestren

Is this related?

https://github.com/tweag/rules_haskell/issues/1657

It seems to be the same error message.

uhthomas avatar Jan 28 '22 12:01 uhthomas

The symptom is similar, but I'm not sure about the root cause.

meteorcloudy avatar Jan 28 '22 13:01 meteorcloudy

I haven't heard loud complaint yet, but I think this issue is preventing users to use rules_docker from Bazel 5.0

meteorcloudy avatar Apr 25 '22 08:04 meteorcloudy

This issue has been automatically marked as stale because it has not had any activity for 180 days. It will be closed if no further activity occurs in 30 days. Collaborators can add an assignee to keep this open indefinitely. Thanks for your contributions to rules_docker!

github-actions[bot] avatar Oct 23 '22 03:10 github-actions[bot]

This issue was automatically closed because it went 30 days without a reply since it was labeled "Can Close?"

github-actions[bot] avatar Nov 23 '22 02:11 github-actions[bot]

Should this be reopened?

mostynb avatar Dec 16 '22 10:12 mostynb

/reopen

farcop avatar Feb 13 '23 08:02 farcop