podman icon indicating copy to clipboard operation
podman copied to clipboard

CI: vfkit broken on macos tahoe runners

Open Luap99 opened this issue 3 months ago • 15 comments

I think vfkit v0.6.2 might be broken at least in our CI env.

  → Enter [It] no settings should change if no flags - /Users/MacM1-5-worker/ci/task-6413995885723648/pkg/machine/e2e/set_test.go:96 @ 01/07/26 09:22:16.64
  /Users/MacM1-5-worker/ci/task-6413995885723648/bin/darwin/podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine.aarch64.applehv.raw a509bf48caa4
  Machine init complete
  To start your machine run:

  	podman machine start a509bf48caa4

  /Users/MacM1-5-worker/ci/task-6413995885723648/bin/darwin/podman machine set a509bf48caa4
  /Users/MacM1-5-worker/ci/task-6413995885723648/bin/darwin/podman machine start a509bf48caa4
  Starting machine "a509bf48caa4"
  Error: vfkit exited unexpectedly with exit code 1

  [FAILED] Expected
      <int>: 125
  to match exit code:
      <int>: 0
  In [It] at: /Users/MacM1-5-worker/ci/task-6413995885723648/pkg/machine/e2e/set_test.go:112 @ 01/07/26 09:22:24.151

  Full Stack Trace
    github.com/containers/podman/v6/pkg/machine/e2e_test.init.func18.3()
    	/Users/MacM1-5-worker/ci/task-6413995885723648/pkg/machine/e2e/set_test.go:112 +0x3fc

https://api.cirrus-ci.com/v1/artifact/task/6413995885723648/html/machine-applehv-podman-darwin-rootless-host.log.html#t--podman-machine-set-set-machine-cpus--disk--memory--1

I guess someone needs to run that with log level debug to get the error message?

cc @baude @ashley-cui @cfergeau

Luap99 avatar Jan 07 '26 12:01 Luap99

Thanks for the heads up! I’ve created a v0.6.2 tag yesterday, but haven’t fully finished the release yet, it’s good to have early feedback! I’ve just tested podman 5.7.1 with the unsigned binary from https://github.com/crc-org/vfkit/actions/runs/20755542863 and podman machine start worked fine, so it’s not totally broken. I’ll take a closer look at the output from your tests.

cfergeau avatar Jan 07 '26 16:01 cfergeau

I cannot reproduce these failures locally on my M1, not sure what fails in CI, I’ll try again tomorrow.

cfergeau avatar Jan 07 '26 16:01 cfergeau

To be clear I am not confident that vfkit is the issue I just noticed the new bump so I thought it is related. Without a reproducer it might still be related to something else. I know I bumped the our macos runners to tahoe but that was on monday but maybe it is related to that since we didn't have other runs since then.

Luap99 avatar Jan 07 '26 17:01 Luap99

Ok I finally checked on the failing macos worker it runs on vfkit 0.6.1 unless I am missing something about the CI setup.

So the version bump was a red herring to me, but it does seem to fail on our worker with tahoe so I guess we need to try to reproduce on that, maybe something releated to the aws image we pull in there

ProductName:		macOS
ProductVersion:		26.2
BuildVersion:		25C56

Luap99 avatar Jan 07 '26 18:01 Luap99

i cannot reproduce this in my mac with.

➜  ~ podman -v
podman version 5.7.1
➜  ~ vfkit -v
vfkit version: v0.6.1
➜  ~ sw_vers
ProductName:		macOS
ProductVersion:		26.2
BuildVersion:		25C56

this could be ami image issue? @Luap99

timcoding1988 avatar Jan 07 '26 19:01 timcoding1988

@cfergeau did you guys hit anything like this in your testign with tahoe?

baude avatar Jan 07 '26 20:01 baude

@cfergeau did you guys hit anything like this in your testing with tahoe?

Testing on Tahoe is not as extensive as we’d like as the only macos runners with virtualization support are limited to macos 15. I should take a look at what you are using for your macOS e2e tests, maybe vfkit could use something similar.

There are no known regressions that I know of when moving from 15 to 26, or from vfkit 0.6.x to 0.6.2, but I’m trying to reproduce this issue to get a better idea of what’s going on. I’m running the e2e tests with podman 5.7.1 for now, but no failure. This report was against the main branch though, so I’ll be testing this next.

cfergeau avatar Jan 08 '26 08:01 cfergeau

With a local run on Tahoe/M1/vfkit 0.6.2, I can’t reproduce the failures with podman 5.7.1 nor with podman main:

Summarizing 1 Failure:
  [PANICKED!] podman machine compose [It] compose test environment variable setup
  /opt/homebrew/Cellar/go/1.25.5/libexec/src/runtime/panic.go:115

The failure looks like a test issue, not something vfkit-related:

  [PANICKED] Test Panicked
  In [It] at: /opt/homebrew/Cellar/go/1.25.5/libexec/src/runtime/panic.go:115 @ 01/08/26 10:04:41.637

  runtime error: index out of range [1] with length 1

  Full Stack Trace
    github.com/containers/podman/v6/pkg/machine/e2e_test.init.func7.1()
    	/Users/teuf/dev/podman/pkg/machine/e2e/compose_test.go:44 +0x6a8

I’m using this to run the test locally:

TMPDIR=/private/tmp make ginkgo-run GINKGO_PARALLEL=n TAGS="remote exclude_graphdriver_btrfs containers_image_openpgp" GINKGO_FLAKE_ATTEMPTS=0 FOCUS_FILE= GINKGOWHAT=pkg/machine/e2e/.

cfergeau avatar Jan 08 '26 09:01 cfergeau

Just realized that most of the local tests with the main branch were running with krunkit so I don’t know if I tested the right thing.

cfergeau avatar Jan 08 '26 09:01 cfergeau

Sorry I just saw https://github.com/containers/podman/pull/27875 passes while https://github.com/containers/podman/pull/27872 (the vfkit code update) fails. I just assumed since the binary execution failed (which we don't update on the runner based on the PR) it must be related to the environment not the code chnage itself. But maybe the new vfkit code passes something as argument that the older vfkit binary cannot understand?

Luap99 avatar Jan 08 '26 10:01 Luap99

yeah the issue is that the new vfkit code forces a new option AFAICT so this upgrade is not backwards compatible.

Since we have little control over all the packaging of podman we cannot enforce that the vfkit go code in podman must match the binary version on the host so we must fix that in a way that preserve backwards compatibility

Error: unknown option for virtio-net devices: type
Usage:
  vfkit [flags]
...

Luap99 avatar Jan 08 '26 10:01 Luap99

Thanks for the investigation. Hopefully https://github.com/cfergeau/vfkit/commit/f504b6ac1b74c114279b652c7a0de8f65bcc22b8 will fix this, I need to test it.

cfergeau avatar Jan 08 '26 17:01 cfergeau

I’m using this to run the test locally:

TMPDIR=/private/tmp make ginkgo-run GINKGO_PARALLEL=n TAGS="remote exclude_graphdriver_btrfs containers_image_openpgp" GINKGO_FLAKE_ATTEMPTS=0 FOCUS_FILE= GINKGOWHAT=pkg/machine/e2e/.

CONTAINERS_MACHINE_PROVIDER=applehv also needs to be set on the main branch in order to use vfkit in e2e tests. With this + go get github.com/crc-org/vfkit && go mod tidy && go mod vendor, and with vfkit 0.6.1 installed, I was able to reproduce. https://github.com/cfergeau/vfkit/commit/f504b6ac1b74c114279b652c7a0de8f65bcc22b8 solves the issue.

cfergeau avatar Jan 08 '26 17:01 cfergeau

This should be fixed if podman uses the go code from github.com/crc-org/[email protected]

cfergeau avatar Jan 09 '26 16:01 cfergeau

Thank you @cfergeau!

Luap99 avatar Jan 09 '26 17:01 Luap99