Darwin e2e test failure in CI
The "Run e2e test for darwin_amd64" job seems to be failing on the same spot fairly frequently in recent runs after merges. Seems to be failing on this scenario.
Running this locally, it does seem like this test takes a while to run, so maybe the error is related to resource constraints? Searching for this error message implies it may be related to the time it takes to instantiate the provider, vs. just a spurious network connection as you might think from reading the error message quickly.
tofu test 19.12s user 4.15s system 123% cpu 18.898 total
Side note: the example code in the test directory (and really, a lot of tf / hcl code in the codebase) is not formatted with tf fmt
OpenTofu Version
latest
OpenTofu Configuration Files
https://github.com/opentofu/opentofu/tree/main/internal/command/e2etest/testdata/overrides-in-tests
Debug Output
See GHA runs
Expected Behavior
The test should pass predictably
Actual Behavior
=== RUN TestMocksAndOverrides
test_test.go:74: unexpected error on 'test': exit status 1
test_test.go:77: unexpected stderr output on 'test':
╷
│ Error: failed to read schema for aws_s3_bucket.test in registry.opentofu.org/hashicorp/aws: failed to instantiate provider "registry.opentofu.org/hashicorp/aws" to obtain schema: timeout while waiting for plugin to start
│
│
╵
Steps to Reproduce
- merge code
Additional Context
No response
References
- Example https://github.com/opentofu/opentofu/actions/runs/11053024375/job/30706734815#logs
Our initial understanding is that this is due to the emulation needed to run arm64 on amd64, but it's worth investigating.
I don't have direct experience with this happening in the e2etests specifically, but this error message functionally means that the plugin client hit a timeout while it was waiting for the plugin to print the plugin handshake message to its stdout. The plugin is expected to print a single line that describes the location of the temporary Unix domain socket that the plugin server is listening on, and the protocol major version it's expecting to speak on that socket.
I don't have any immediate concrete theories about exactly why the plugin seems to be hanging on startup -- not exiting with an error, but also not printing anything. This symptom was once explained by some antivirus software interfering with the plugin process at the kernel level, causing a deadlock in the plugin that caused it to hang forever, but that was on Windows rather than macOS.
I think @cam72cam's idea about Rosetta emulation is definitely worth tugging on a little. Although I've typically heard good things about Terraform and providers running under Rosetta, it is still an additional variable that could potentially cause problems for an unsuspecting plugin server.
Ah, Ok. I had assumed it had worked at one point. Is it worth skipping that test for now (until a better solution is found) if it's reliably failing?
It does seem to sometimes succeed, I think?
To be clear, I was not meaning to assert that this test is expected to fail, although perhaps that was @cam72cam's intention. (I didn't read it that way!)
Given how the e2etest harness works (when you run it through go test), I would've expected it to:
- Build a
tofuexecutable from the current source code into a temporary directory, using a command equivalent togo build -o/tmp/something/tofo github.com/opentofu/opentofu/cmd/tofu, giving you an executable built for your host platformdarwin_arm64. - Run
tofu initwith that executable in a temporary copy of the fixture directory you indicated, which would therefore create a.terraform/providersdirectory and download thedarwin_arm64builds of all of the needed plugins into it. (This particular test fixture is unusual in that it uses quite a few different plugins all at once.) - Run
tofu testwith the same executable in the same directory, thereby causing the test harness to spin up all the mocked plugins to learn their schemas. That seems to be where your error occurred.
I don't know of any immediate reason why that should fail, but maybe you could try to reproduce it manually at your shell prompt by running similar commands and see if you get the same result.
In particular, I don't see any reason why Rosetta ought to be involved here, now that I actually look at the test.
Right, sorry, I also didn't mean "expected" to fail in that sense. Just that if it's failing most or all of the time, marking it as skipped might be better than having that particular test red all the time.
While I would normally think this a rather rude comment to make, I guess in this case it's actually useful information... :grinning:
It works just fine on my (linux_amd64) machine, so if there is a general problem here it's at least scoped to only a subset of platforms.
I think we should probably try to decide whether this problem is unique to your computer or is a general problem for all darwin_arm64 machines. I don't have such a machine with a Go toolchain handy, but I'd bet that there's someone else pretty nearby who does. :grinning:
Not sure I follow? I have not ever tried to run this test locally (I'm not sure if this e2e test will work properly locally). I'm just observing that it seems to fail very predictably in the GH Actions job that runs on merge to main.
The only (low-tech) thing I tried to do locally was to init in the same directory that the test uses on an arm64 machine.
And sorry if my wording in the initial description was confusing. When I said "running this locally", I was just referring to initing / and running the [tofu] test on the set of configs the test uses, not to running the test. That succeeded (running tofu test provided the expected output, that is, not running the go e2e test suite).
I'm sure you all would have a better idea about possible reasons for failures. I was just speculating based on my experiences with general test failures in CI, and from the stuff I found online about timeouts (and from , that the flaky test could be due to resource constraints on the CI container. If I'm totally off base there, feel free to disregard.
Ahh, indeed, sorry I had totally misunderstood that your focus was on the automated runs in GitHub Actions, rather than on your own system. Though of course now I re-read it seems obvious in retrospect! Sorry for the diversion.
The GitHub Actions environment likely does add some extra variables into the mix. I don't have a great deal of experience with the details of that, but I'm pretty sure this isn't the only end-to-end test that installs and runs a plugin so I guess there must be something unusual/interesting about this test.
The fact that it depends on so many provider plugins at once might be it -- if the GitHub Actions macOS runner has tight RAM usage constraints, for example, then asking a bunch of providers for their schemas all in quick succession could potentially cause a peak memory usage limit to be crossed.
I'm not sure if you're debugging / re-running it or if this is just a coincidence, but that job off the last commit seems to have hung vs. succeeding or failing (at the same spot: TestMocksAndOverrides)?
https://github.com/opentofu/opentofu/actions/runs/11151350053/job/30994734276
The fact that it depends on so many provider plugins at once might be it -- if the GitHub Actions macOS runner has tight RAM usage constraints
Yes, that's exactly the kind of scenario I was thinking about, especially looking at other scenarios that caused that issue, and considering the fact that it seems flaky vs. failing consistently. I don't think GHA trivially provides a way to see that resource utilization, even for folks with privileges on a project, so it is a tricky kind of issue to track down or fix.
I'm not personally doing anything to the GitHub Actions jobs at this time.
And yes, I can see how it might have read that way. Updated the title above which hopefully helps make it more clear.
https://github.com/opentofu/opentofu/blob/main/.github/workflows/checks.yml#L55-L59
We are using QEMU under the hood to execute binaries from alternate architectures in CI. This ties into binfmt_misc to provide the emulation layer.
That emulation might be using more memory than expected or just be slower to execute certain operations on certain architectures.
One thought for a low-tech option: we could take a pass at splitting the tofu code the integration test runs into two different testdata directories (and turn it into two tests, maybe with one containing the tests that use the local provider, and the other the ones that use the random and aws providers?). The existing test is kind of hard to parse anyway, since there are so many layers of aliasing, mocking, etc. going on.
One other thing I observed: running the tofu code here using terraform 1.9.6 instead of opentofu (again, on the tf test, not the go integration test that runs it) results in 3 test failures. I opened an issue with them for what I think might be a bug this turns up for at least one of the issues, unless I'm misunderstanding the way overrides are intended to work.
Also, assuming OpenTofu doesn't have access to MacOS runners so it could run without emulation?
I do see that the release task runs on the larger-runners resource - if those are available, maybe configuring this test to run on a larger runner might work?
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
index e21ed4436b..d2eb072e64 100644
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -154,7 +154,7 @@ jobs:
e2etest-build:
name: Build e2etest for ${{ matrix.goos }}_${{ matrix.goarch }}
- runs-on: ubuntu-latest
+ runs-on: ${{ matrix.runson }}
outputs:
e2e-cache-key: ${{ steps.set-cache-values.outputs.e2e-cache-key }}
e2e-cache-path: ${{ steps.set-cache-values.outputs.e2e-cache-path }}
@@ -164,14 +164,14 @@ jobs:
strategy:
matrix:
include:
- - {goos: "darwin", goarch: "amd64"}
- - {goos: "darwin", goarch: "arm64"}
- - {goos: "windows", goarch: "amd64"}
- - {goos: "windows", goarch: "386"}
- - {goos: "linux", goarch: "386"}
- - {goos: "linux", goarch: "amd64"}
- - {goos: linux, goarch: "arm"}
- - {goos: linux, goarch: "arm64"}
+ - {goos: "darwin", goarch: "amd64", runson: larger-runners}
+ - {goos: "darwin", goarch: "arm64", runson: ubuntu-latest}
+ - {goos: "windows", goarch: "amd64", runson: ubuntu-latest}
+ - {goos: "windows", goarch: "386", runson: ubuntu-latest}
+ - {goos: "linux", goarch: "386", runson: ubuntu-latest}
+ - {goos: "linux", goarch: "amd64", runson: ubuntu-latest}
+ - {goos: linux, goarch: "arm", runson: ubuntu-latest}
+ - {goos: linux, goarch: "arm64", runson: ubuntu-latest}
fail-fast: false
env:
I can check with @Yantrio next week what our budget is for the larger runners, which would probably be the easiest solution here.
Normalized deviance of test failures is not a good path to be on.
Side note: the reason the override tests behave differently with upstream Terraform is that they don't sandbox each run block, and so unless you specifically tell it to (via plan_options { replace [] }) the resource, the overrides won't take effect when the same resource name is used.
Not sure if the divergence from terraform's behavior here is a feature or a bug, and whether or not it's worth filing an issue about?
This could also be the reason why it's faster (and for the random provider especially, probably less resource intensive) with Terraform
I stumbled across this issue while googling the error message, might not be the right place to post this, but I'm experiencing this as part of terragrunt plan on an M2 MBP.
12:14:09.108 STDERR tofu: │ Error: timeout while waiting for plugin to start
12:14:09.108 STDERR tofu: │
It's inconsistent, though. I'd say 4/5 times it doesn't work, sometimes I get lucky.
My tofu --version output is:
OpenTofu v1.6.2
on darwin_amd64
@lagerfeuer, that's very interesting. Do you mind if I ask what provider you are using?
I'm using the following:
terraform {
required_version = ">= 1.0.0, < 2.0.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
datadog = {
source = "DataDog/datadog"
version = "~> 3.44"
}
}
}
I've also gotten a different error sporadically that may or may not be related:
11:45:23.668 STDERR tofu: │ Error: Failed to load plugin schemas
11:45:23.668 STDERR tofu: │
11:45:23.668 STDERR tofu: │ Error while loading schemas for plugin components: Failed to obtain
11:45:23.668 STDERR tofu: │ provider schema: Could not load the schema for provider
11:45:23.669 STDERR tofu: │ registry.opentofu.org/hashicorp/aws: failed to instantiate provider
11:45:23.669 STDERR tofu: │ "registry.opentofu.org/hashicorp/aws" to obtain schema: timeout while
11:45:23.669 STDERR tofu: │ waiting for plugin to start..
11:45:23.669 STDERR tofu: ╵
@lagerfeuer Can you switch from darwin_amd64 to darwin_arm64? The emulation is likely causing too long of a delay.
In case someone else finds this issue: I am using tofuenv to manage opentofu versions. It appears that the first time I installed v1.6.2, it was using darwin_amd64. I've updated tofuenv meanwhile and reinstalling this version correctly installed darwin_amd64 now. Haven't done a lot of testing, but it seems to work now.
Thanks, and sorry for hijacking this issue @cam72cam.
Seems like the Darwin e2e tests are now passing, e.g., https://github.com/opentofu/opentofu/actions/runs/15639168433/job/44062069048?
I'm going to close this, but feel free to reopen if there's a need to.