opentofu icon indicating copy to clipboard operation
opentofu copied to clipboard

Darwin e2e test failure in CI

Open wyardley opened this issue 1 year ago • 22 comments

The "Run e2e test for darwin_amd64" job seems to be failing on the same spot fairly frequently in recent runs after merges. Seems to be failing on this scenario.

Running this locally, it does seem like this test takes a while to run, so maybe the error is related to resource constraints? Searching for this error message implies it may be related to the time it takes to instantiate the provider, vs. just a spurious network connection as you might think from reading the error message quickly.

tofu test  19.12s user 4.15s system 123% cpu 18.898 total

Side note: the example code in the test directory (and really, a lot of tf / hcl code in the codebase) is not formatted with tf fmt

OpenTofu Version

latest

OpenTofu Configuration Files

https://github.com/opentofu/opentofu/tree/main/internal/command/e2etest/testdata/overrides-in-tests

Debug Output

See GHA runs

Expected Behavior

The test should pass predictably

Actual Behavior

=== RUN   TestMocksAndOverrides
    test_test.go:74: unexpected error on 'test': exit status 1
    test_test.go:77: unexpected stderr output on 'test':
        ╷
        │ Error: failed to read schema for aws_s3_bucket.test in registry.opentofu.org/hashicorp/aws: failed to instantiate provider "registry.opentofu.org/hashicorp/aws" to obtain schema: timeout while waiting for plugin to start
        │ 
        │ 
        ╵

Steps to Reproduce

  1. merge code

Additional Context

No response

References

  • Example https://github.com/opentofu/opentofu/actions/runs/11053024375/job/30706734815#logs

wyardley avatar Oct 02 '24 21:10 wyardley

Our initial understanding is that this is due to the emulation needed to run arm64 on amd64, but it's worth investigating.

cam72cam avatar Oct 02 '24 22:10 cam72cam

I don't have direct experience with this happening in the e2etests specifically, but this error message functionally means that the plugin client hit a timeout while it was waiting for the plugin to print the plugin handshake message to its stdout. The plugin is expected to print a single line that describes the location of the temporary Unix domain socket that the plugin server is listening on, and the protocol major version it's expecting to speak on that socket.

I don't have any immediate concrete theories about exactly why the plugin seems to be hanging on startup -- not exiting with an error, but also not printing anything. This symptom was once explained by some antivirus software interfering with the plugin process at the kernel level, causing a deadlock in the plugin that caused it to hang forever, but that was on Windows rather than macOS.

I think @cam72cam's idea about Rosetta emulation is definitely worth tugging on a little. Although I've typically heard good things about Terraform and providers running under Rosetta, it is still an additional variable that could potentially cause problems for an unsuspecting plugin server.

apparentlymart avatar Oct 02 '24 22:10 apparentlymart

Ah, Ok. I had assumed it had worked at one point. Is it worth skipping that test for now (until a better solution is found) if it's reliably failing?

It does seem to sometimes succeed, I think?

wyardley avatar Oct 02 '24 22:10 wyardley

To be clear, I was not meaning to assert that this test is expected to fail, although perhaps that was @cam72cam's intention. (I didn't read it that way!)

Given how the e2etest harness works (when you run it through go test), I would've expected it to:

  • Build a tofu executable from the current source code into a temporary directory, using a command equivalent to go build -o/tmp/something/tofo github.com/opentofu/opentofu/cmd/tofu, giving you an executable built for your host platform darwin_arm64.
  • Run tofu init with that executable in a temporary copy of the fixture directory you indicated, which would therefore create a .terraform/providers directory and download the darwin_arm64 builds of all of the needed plugins into it. (This particular test fixture is unusual in that it uses quite a few different plugins all at once.)
  • Run tofu test with the same executable in the same directory, thereby causing the test harness to spin up all the mocked plugins to learn their schemas. That seems to be where your error occurred.

I don't know of any immediate reason why that should fail, but maybe you could try to reproduce it manually at your shell prompt by running similar commands and see if you get the same result.

In particular, I don't see any reason why Rosetta ought to be involved here, now that I actually look at the test.

apparentlymart avatar Oct 02 '24 23:10 apparentlymart

Right, sorry, I also didn't mean "expected" to fail in that sense. Just that if it's failing most or all of the time, marking it as skipped might be better than having that particular test red all the time.

wyardley avatar Oct 02 '24 23:10 wyardley

While I would normally think this a rather rude comment to make, I guess in this case it's actually useful information... :grinning:

It works just fine on my (linux_amd64) machine, so if there is a general problem here it's at least scoped to only a subset of platforms.

I think we should probably try to decide whether this problem is unique to your computer or is a general problem for all darwin_arm64 machines. I don't have such a machine with a Go toolchain handy, but I'd bet that there's someone else pretty nearby who does. :grinning:

apparentlymart avatar Oct 02 '24 23:10 apparentlymart

Not sure I follow? I have not ever tried to run this test locally (I'm not sure if this e2e test will work properly locally). I'm just observing that it seems to fail very predictably in the GH Actions job that runs on merge to main.

The only (low-tech) thing I tried to do locally was to init in the same directory that the test uses on an arm64 machine.

wyardley avatar Oct 02 '24 23:10 wyardley

And sorry if my wording in the initial description was confusing. When I said "running this locally", I was just referring to initing / and running the [tofu] test on the set of configs the test uses, not to running the test. That succeeded (running tofu test provided the expected output, that is, not running the go e2e test suite).

I'm sure you all would have a better idea about possible reasons for failures. I was just speculating based on my experiences with general test failures in CI, and from the stuff I found online about timeouts (and from , that the flaky test could be due to resource constraints on the CI container. If I'm totally off base there, feel free to disregard.

wyardley avatar Oct 03 '24 00:10 wyardley

Ahh, indeed, sorry I had totally misunderstood that your focus was on the automated runs in GitHub Actions, rather than on your own system. Though of course now I re-read it seems obvious in retrospect! Sorry for the diversion.

The GitHub Actions environment likely does add some extra variables into the mix. I don't have a great deal of experience with the details of that, but I'm pretty sure this isn't the only end-to-end test that installs and runs a plugin so I guess there must be something unusual/interesting about this test.

The fact that it depends on so many provider plugins at once might be it -- if the GitHub Actions macOS runner has tight RAM usage constraints, for example, then asking a bunch of providers for their schemas all in quick succession could potentially cause a peak memory usage limit to be crossed.

apparentlymart avatar Oct 03 '24 00:10 apparentlymart

I'm not sure if you're debugging / re-running it or if this is just a coincidence, but that job off the last commit seems to have hung vs. succeeding or failing (at the same spot: TestMocksAndOverrides)?

https://github.com/opentofu/opentofu/actions/runs/11151350053/job/30994734276

The fact that it depends on so many provider plugins at once might be it -- if the GitHub Actions macOS runner has tight RAM usage constraints

Yes, that's exactly the kind of scenario I was thinking about, especially looking at other scenarios that caused that issue, and considering the fact that it seems flaky vs. failing consistently. I don't think GHA trivially provides a way to see that resource utilization, even for folks with privileges on a project, so it is a tricky kind of issue to track down or fix.

wyardley avatar Oct 03 '24 00:10 wyardley

I'm not personally doing anything to the GitHub Actions jobs at this time.

apparentlymart avatar Oct 03 '24 00:10 apparentlymart

And yes, I can see how it might have read that way. Updated the title above which hopefully helps make it more clear.

wyardley avatar Oct 03 '24 01:10 wyardley

https://github.com/opentofu/opentofu/blob/main/.github/workflows/checks.yml#L55-L59

We are using QEMU under the hood to execute binaries from alternate architectures in CI. This ties into binfmt_misc to provide the emulation layer.

That emulation might be using more memory than expected or just be slower to execute certain operations on certain architectures.

cam72cam avatar Oct 03 '24 13:10 cam72cam

One thought for a low-tech option: we could take a pass at splitting the tofu code the integration test runs into two different testdata directories (and turn it into two tests, maybe with one containing the tests that use the local provider, and the other the ones that use the random and aws providers?). The existing test is kind of hard to parse anyway, since there are so many layers of aliasing, mocking, etc. going on.

One other thing I observed: running the tofu code here using terraform 1.9.6 instead of opentofu (again, on the tf test, not the go integration test that runs it) results in 3 test failures. I opened an issue with them for what I think might be a bug this turns up for at least one of the issues, unless I'm misunderstanding the way overrides are intended to work.

wyardley avatar Oct 03 '24 23:10 wyardley

Also, assuming OpenTofu doesn't have access to MacOS runners so it could run without emulation?

I do see that the release task runs on the larger-runners resource - if those are available, maybe configuring this test to run on a larger runner might work?

diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
index e21ed4436b..d2eb072e64 100644
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -154,7 +154,7 @@ jobs:
 
   e2etest-build:
     name: Build e2etest for ${{ matrix.goos }}_${{ matrix.goarch }}
-    runs-on: ubuntu-latest
+    runs-on: ${{ matrix.runson }}
     outputs:
       e2e-cache-key: ${{ steps.set-cache-values.outputs.e2e-cache-key }}
       e2e-cache-path: ${{ steps.set-cache-values.outputs.e2e-cache-path }}
@@ -164,14 +164,14 @@ jobs:
     strategy:
       matrix:
         include:
-          - {goos: "darwin", goarch: "amd64"}
-          - {goos: "darwin", goarch: "arm64"}
-          - {goos: "windows", goarch: "amd64"}
-          - {goos: "windows", goarch: "386"}
-          - {goos: "linux", goarch: "386"}
-          - {goos: "linux", goarch: "amd64"}
-          - {goos: linux, goarch: "arm"}
-          - {goos: linux, goarch: "arm64"}
+          - {goos: "darwin", goarch: "amd64", runson: larger-runners}
+          - {goos: "darwin", goarch: "arm64", runson: ubuntu-latest}
+          - {goos: "windows", goarch: "amd64", runson: ubuntu-latest}
+          - {goos: "windows", goarch: "386", runson: ubuntu-latest}
+          - {goos: "linux", goarch: "386", runson: ubuntu-latest}
+          - {goos: "linux", goarch: "amd64", runson: ubuntu-latest}
+          - {goos: linux, goarch: "arm", runson: ubuntu-latest}
+          - {goos: linux, goarch: "arm64", runson: ubuntu-latest}
       fail-fast: false
 
     env:

wyardley avatar Oct 04 '24 00:10 wyardley

I can check with @Yantrio next week what our budget is for the larger runners, which would probably be the easiest solution here.

Normalized deviance of test failures is not a good path to be on.

cam72cam avatar Oct 04 '24 12:10 cam72cam

Side note: the reason the override tests behave differently with upstream Terraform is that they don't sandbox each run block, and so unless you specifically tell it to (via plan_options { replace [] }) the resource, the overrides won't take effect when the same resource name is used.

Not sure if the divergence from terraform's behavior here is a feature or a bug, and whether or not it's worth filing an issue about?

This could also be the reason why it's faster (and for the random provider especially, probably less resource intensive) with Terraform

wyardley avatar Oct 07 '24 15:10 wyardley

I stumbled across this issue while googling the error message, might not be the right place to post this, but I'm experiencing this as part of terragrunt plan on an M2 MBP.

12:14:09.108 STDERR tofu: │ Error: timeout while waiting for plugin to start
12:14:09.108 STDERR tofu: │ 

It's inconsistent, though. I'd say 4/5 times it doesn't work, sometimes I get lucky.

My tofu --version output is:

OpenTofu v1.6.2
on darwin_amd64

lagerfeuer avatar Oct 23 '24 16:10 lagerfeuer

@lagerfeuer, that's very interesting. Do you mind if I ask what provider you are using?

cam72cam avatar Oct 23 '24 16:10 cam72cam

I'm using the following:

terraform {
  required_version = ">= 1.0.0, < 2.0.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
    datadog = {
      source  = "DataDog/datadog"
      version = "~> 3.44"
    }
  }
}

I've also gotten a different error sporadically that may or may not be related:

11:45:23.668 STDERR tofu: │ Error: Failed to load plugin schemas
11:45:23.668 STDERR tofu: │ 
11:45:23.668 STDERR tofu: │ Error while loading schemas for plugin components: Failed to obtain
11:45:23.668 STDERR tofu: │ provider schema: Could not load the schema for provider
11:45:23.669 STDERR tofu: │ registry.opentofu.org/hashicorp/aws: failed to instantiate provider
11:45:23.669 STDERR tofu: │ "registry.opentofu.org/hashicorp/aws" to obtain schema: timeout while
11:45:23.669 STDERR tofu: │ waiting for plugin to start..
11:45:23.669 STDERR tofu: ╵

lagerfeuer avatar Oct 23 '24 16:10 lagerfeuer

@lagerfeuer Can you switch from darwin_amd64 to darwin_arm64? The emulation is likely causing too long of a delay.

cam72cam avatar Oct 29 '24 16:10 cam72cam

In case someone else finds this issue: I am using tofuenv to manage opentofu versions. It appears that the first time I installed v1.6.2, it was using darwin_amd64. I've updated tofuenv meanwhile and reinstalling this version correctly installed darwin_amd64 now. Haven't done a lot of testing, but it seems to work now.

Thanks, and sorry for hijacking this issue @cam72cam.

lagerfeuer avatar Nov 08 '24 19:11 lagerfeuer

Seems like the Darwin e2e tests are now passing, e.g., https://github.com/opentofu/opentofu/actions/runs/15639168433/job/44062069048?

I'm going to close this, but feel free to reopen if there's a need to.

wyardley avatar Jun 15 '25 18:06 wyardley