continuous-integration icon indicating copy to clipboard operation
continuous-integration copied to clipboard

rules_go fails with bazel@HEAD, but not tied to a commit

Open c-parsons opened this issue 6 years ago • 16 comments

rules_go fails with bazel@HEAD, but it doesn't seem tied to any particular bazel commit. The failing tests are coverage-related, and seem to be that the coverage file is not found

https://buildkite.com/bazel/culprit-finder/builds/171#863978e5-2c65-4b87-8f9f-3aa1487bc77c

I also tried triggering the rules_go CI with release Bazel, and got partial same failures:

https://buildkite.com/bazel/rules-go-golang/builds/1196

There hasn't been a recent commit in rules_go, so I have to assume this is a CI issue.

c-parsons avatar Jun 20 '19 14:06 c-parsons

See also https://github.com/bazelbuild/rules_go/issues/2100

laurentlb avatar Jun 20 '19 19:06 laurentlb

@c-parsons I think this is caused by the release of Bazel 0.27.0.

We didn't change anything on CI itself in quite a while.

philwo avatar Jun 21 '19 14:06 philwo

Pardon my ignorance, but how would the release of Bazel 0.27.0 affect testing rules_go against Bazel@HEAD at various commits?

A Bazel release affecting rules_go CI for Bazel@Release would make sense, but I'd expect to be able to bisect the Bazel@HEAD failures and tie them to a Bazel commit...

c-parsons avatar Jun 21 '19 15:06 c-parsons

@c-parsons That's a good question, but rules_go is definitely green with 0.26.1 and red with 0.27.0: https://buildkite.com/bazel/rules-go-golang/builds/1203 (upper "Ubuntu 18.04" is Bazel 0.26.1, lower "Ubuntu 18.04" is with Bazel 0.27.0).

The "known good" commit that culprit finder ran with in your linked build above (a44ea875254c5a630000f1838764e525cdb864ce) is only 7 days old, so maybe it's just not old enough? Maybe the breakage was much earlier?

philwo avatar Jun 21 '19 16:06 philwo

I'm running another culprit finder here with good commit = 0.27.0 and bad commit = 0.26.1, let's see how that goes: https://buildkite.com/bazel/culprit-finder/builds/172

philwo avatar Jun 21 '19 16:06 philwo

Culprit finder is failing, because it can't find the Bazel binaries it needs on GCS. :(

I have to leave the office now - maybe you can just bisect this locally on your machine?

philwo avatar Jun 21 '19 16:06 philwo

So, this is failed not tied to a release because these tests uses some details about the host machine, particularly it will, in test, run /usr/bin/bazel. I bet that when we released Bazel 0.27.0, the CI machines permanently have 0.27.0 at that location, even for bazel@HEAD tests.

I think we might be able to alter which bazel is used in this test runner by setting the BAZEL environment variable...

Still, what frustrating nondeterminism...

c-parsons avatar Jun 21 '19 19:06 c-parsons

cc @jayconrod

Good catch! This explains why the target was failing for me with old commits and old Bazel versions.

laurentlb avatar Jun 21 '19 19:06 laurentlb

I opened bazelbuild/bazel#8670. I believe it's a bug in coverage.

jayconrod avatar Jun 21 '19 19:06 jayconrod

Oh, wow. Yes, due to the Bazel detection logic in rules_go's bazel_tests.bzl, the "Bazel@HEAD" tests never ran these tests with HEAD, instead they always used the latest release. That's why we didn't detect this breakage before 0.27.0 was released.

I have to think about how to prevent this from happening on CI next week.

philwo avatar Jun 21 '19 19:06 philwo

The test passes locally when I use 0.26.1 and I set the BAZEL variable:

BAZEL=~/.cache/bazelisk/bin/bazel-0.26.1-linux-x86_64 bazelisk test //tests/core/coverage:coverage_test_test 

So the breakage is caused by 0.27. But I cannot reproduce it with any incompatible flag, so it's a bug in Bazel.

laurentlb avatar Jun 21 '19 20:06 laurentlb

@laurentlb Did we find the bug in Bazel? Should we rollback the culprit?

meteorcloudy avatar Jun 27 '19 11:06 meteorcloudy

@meteorcloudy The issue was bazelbuild/bazel#8670. The fix is cherry-picked in bazel 0.27.1.rc1, and I've verified it fixes rules_go CI (log).

jayconrod avatar Jun 27 '19 17:06 jayconrod

@jayconrod Thanks for the notice!

meteorcloudy avatar Jun 28 '19 08:06 meteorcloudy

For the reference, https://github.com/bazelbuild/rules_go/blob/master/tests/bazel_tests.bzl#L331 is the place where rules_go try to find Bazel binary. Bazel CI currently doesn't set BAZEL environment variable so this logic is incorrect when running on Bazel CI.

@philwo @jayconrod do you have ideas how we can make this work? Can Bazel CI set BAZEL env var? Can we communicate Bazel binary in some other way? Should we do it at all?

hlopko avatar Jul 02 '19 09:07 hlopko

@hlopko The issue here was a regression in coverage support in 0.27.0. It should be fixed in 0.27.1. I think the tests were just picking up bazel from PATH.

I think the BAZEL environment variable was a vestige from Jenkins. It seems better to just get the target version of bazel from PATH though. Most tools and local tests will pick it up automatically without having to be customized for this environment.

jayconrod avatar Jul 02 '19 13:07 jayconrod