finch
finch copied to clipboard
e2e tests failing intermittently
Describe the bug During the process of reviewing and merging https://github.com/runfinch/finch/pull/133, I noticed intermittent and unpredictable failures in the e2e tests. I addressed some of these issues in the PR itself, but some of the failures persist. In the repository's current state, I've seen two types of failures:
-
finch vm stop
timeouts Occasionally,finch vm stop
takes longer than 60 seconds to execute. I see such a timeout happen in about every other run of the e2e tests. The tests run this command ~20 times, and we run the e2e tests on 4 different hosts each test run. So it seems to happen on about 1% of stop commands. Because the tests assume that the vm is in a particular state before the test starts, this can also cause carry-over effects to other tests, which results in multiple tests failing. -
issues with BuildKit starting up
I've only seen this one time so far, but it didn't seem caused by a
finch vm stop
timeout. Below are the relevant logs:
time="2023-01-07T01:17:37Z" level=error msg="`buildctl` needs to be installed and `buildkitd` needs to be running, see https://github.com/moby/buildkit , and `containerd-rootless-setuptool.sh install-buildkit` for OCI worker or `containerd-rootless-setuptool.sh install-buildkit-containerd` for containerd worker" error="3 errors occurred:\n\t* failed to ping to host unix:///run/user/501/buildkit-finch/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit-default/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit/buildkitd.sock: exit status 1\n\n"
time="2023-01-07T01:17:37Z" level=fatal msg="no buildkit host is available, tried 3 candidates: 3 errors occurred:\n\t* failed to ping to host unix:///run/user/501/buildkit-finch/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit-default/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit/buildkitd.sock: exit status 1\n\n"
The test failures seem to happen more often on ARM hosts than x86 hosts. The ARM hosts running OSX 11.7 in particular seem to have issues with finch vm stop
timeouts.
Steps to reproduce Run e2e tests
Expected behavior Tests should succeed consistently
Screenshots or logs If applicable, add screenshots or logs to help explain your problem.
Additional context Add any other context about the problem here.
Failure captured in my PR:
Finch CLI e2e Tests Run a container image when running a container that echos dummy output should not echo dummy output if running with -d flag
/Users/ec2-user/go/pkg/mod/github.com/runfinch/[email protected]/tests/run.go:59
d64e113339f9d2b97d87dea6eeedc776d1bc5679bbccc52f75f876584ec586f9
No containers to be removed
ce14a6258f37
No images to be removed
9bb07d084c6f3455a2e739b9d4c62e7171fe4312db73846dab17f46da769c919
bridge
host
none
No networks to be removed
time="2023-01-11T03:20:37Z" level=error msg="`buildctl` needs to be installed and `buildkitd` needs to be running, see https://github.com/moby/buildkit , and `containerd-rootless-setuptool.sh install-buildkit` for OCI worker or `containerd-rootless-setuptool.sh install-buildkit-containerd` for containerd worker" error="3 errors occurred:\n\t* failed to ping to host unix:///run/user/501/buildkit-finch/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit-default/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit/buildkitd.sock: exit status 1\n\n"
time="2023-01-11T03:20:37Z" level=fatal msg="no buildkit host is available, tried 3 candidates: 3 errors occurred:\n\t* failed to ping to host unix:///run/user/501/buildkit-finch/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit-default/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit/buildkitd.sock: exit status 1\n\n"
time="2023-01-11T03:20:37Z" level=fatal msg="exit status 1"
[FAILED] in [BeforeEach] - /Users/ec2-user/go/pkg/mod/github.com/runfinch/[email protected]/command/command.go:122 @ 01/11/23 03:20:37.807
d64e113339f9d2b97d87dea6eeedc776d1bc5679bbccc52f75f876584ec586f9
No containers to be removed
ce14a6258f37
No images to be removed
9bb07d084c6f3455a2e739b9d4c62e7171fe4312db73846dab17f46da769c919
bridge
host
none
No networks to be removed
• [FAILED] [1.103 seconds]
Finch CLI e2e Tests Run a container image when running a container that echos dummy output [BeforeEach] should not echo dummy output if running with -d flag
[BeforeEach] /Users/ec2-user/go/pkg/mod/github.com/runfinch/[email protected]/tests/run.go:45
[It] /Users/ec2-user/go/pkg/mod/github.com/runfinch/[email protected]/tests/run.go:59
[FAILED] Expected
<int>: 1
to match exit code:
<int>: 0
In [BeforeEach] at: /Users/ec2-user/go/pkg/mod/github.com/runfinch/[email protected]/command/command.go:122 @ 01/11/23 03:20:37.807
This set of 4 failures also often appear together (e.g., also in some dependabot PRs). One of them is caused by vm stop
timeout, and the remaining 3 are caused by buildkit.
[FAIL] Finch CLI e2e Tests Run a container image when running a container that echos dummy output [BeforeEach] should echo dummy output
/Users/ec2-user/go/pkg/mod/github.com/runfinch/[email protected]/command/command.go:122
[FAIL] Finch CLI e2e Tests Run a container image when running a container that echos dummy output [BeforeEach] should not echo dummy output if running with -d flag
/Users/ec2-user/go/pkg/mod/github.com/runfinch/[email protected]/command/command.go:122
[FAIL] Finch CLI e2e Tests Run a container image when running a container with environment related flags [It] with --entrypoint flag, ENTRYPOINT in dockerfile should not be executed
/Users/ec2-user/go/pkg/mod/github.com/runfinch/[email protected]/command/command.go:122
[FAIL] Finch CLI e2e Tests Config [It] updates config values when a config file is present
/Users/ec2-user/go/pkg/mod/github.com/runfinch/[email protected]/command/command.go:115
Interesting, from https://github.com/runfinch/finch/pull/154, it seems like a host encountered a vm stop timeout command and then proceeded to fail more with a buildkit related error in a "refreshed" run.
A couple of strange failures I've run into on #154, both on the ARM 11.7 runners:
- Stop-start-stop: looks like the stop command registers failure, start command starts up, and somehow the stop command notices the VM has changed state and stops it immediately?
[FAILED] in [It] - /Users/ec2-user/go/pkg/mod/github.com/runfinch/[email protected]/command/command.go:115 @ 01/13/23 00:09:08.278
time="2023-01-13T00:09:08Z" level=fatal msg="the instance \"finch\" is already stopped"
time="2023-01-13T00:09:08Z" level=info msg="Starting existing Finch virtual machine..."
time="2023-01-13T00:09:09Z" level=info msg="Finch virtual machine stopped successfully"
time="2023-01-13T00:09:10Z" level=error msg="Finch virtual machine failed to start, debug logs: time=\"2023-01-13T00:09:08Z\" level=info msg=\"Using the existing instance \\\"finch\\\"\"\ntime=\"2023-01-13T00:09:08Z\" level=info msg=\"Attempting to download the nerdctl archive from \\\"https://github.com/containerd/nerdctl/releases/download/v1.1.0/nerdctl-full-1.1.0-linux-arm64.tar.gz\\\"\" digest=\"sha256:3b613a1be5a24460c44bb93a3609b790ada94e06efd1a86467d45bec7da8b449\"\ntime=\"2023-01-13T00:09:08Z\" level=info msg=\"Using cache \\\"/Users/ec2-user/Library/Caches/lima/download/by-url-sha256/c69be86b3e48430e3d687d54361cf15b90e2c067fae6a294ca6292d41f42bf0e/data\\\"\"\ntime=\"2023-01-13T00:09:10Z\" level=warning msg=\"[hostagent] Reducing the guest memory from 4GiB to 3GiB, to avoid host kernel panic on macOS <= 12.3 with QEMU >= 7.0; Please update macOS to 12.4 or later, or downgrade QEMU to 6.2; See https://github.com/lima-vm/lima/issues/795 for the further background.\"\ntime=\"2023-01-13T00:09:10Z\" level=info msg=\"[hostagent] Mounting disk \\\"finch\\\" on \\\"/mnt/lima-finch\\\"\"\ntime=\"2023-01-13T00:09:10Z\" level=fatal msg=\"exiting, status={Running:false Degraded:false Exiting:true Errors:[] SSHLocalPort:0} (hint: see \\\"/Users/ec2-user/ar/_work/finch/finch/_output/lima/data/finch/ha.stderr.log\\\")\"\n"
time="2023-01-13T00:09:10Z" level=fatal msg="exit status 1"
- Finch compose down failure:
[FAILED] [2.657 seconds]
Finch Container Development E2E Tests Compose down command [It] should stop compose services and delete volumes by specifying --volumes flag
/Users/ec2-user/go/pkg/mod/github.com/runfinch/[email protected]/tests/compose_down.go:45
[FAILED] Expected
<[]uint8 | len:13, cap:13>: [100, 53, 98, 54, 98, 102, 52, 52, 100, 52, 49, 97, 10]
to be empty
In [It] at: /Users/ec2-user/go/pkg/mod/github.com/runfinch/[email protected]/tests/tests.go:134