Concurrent instance launch test is flaky
Please confirm
- [x] I have searched existing issues to check if an issue already exists for the bug I encountered.
Distribution
Ubuntu 22.04
Distribution version
Github runner
Output of "lxc info" or system info if it fails
N/A
Issue description
Since https://github.com/canonical/lxd/pull/15737 the test_concurrent tests have been failing intermittently for various reasons. Also we are seeing similar issues on the lxd_benchmark_basic test.
To keep track of the various errors to try and help identify the issue:
test_concurrent failures on ZFS:
Error: Failed creating instance from image: Failed to run: zfs set refreservation=none lxdtest-JL1/images/024b940c940c055d69aee9645c005483c212752075fc16281252a41ccfa9b91a: fork/exec /usr/sbin/zfs: bad file descriptor
lxd_benchmark_basic failures:
Failed to start container 'benchmark-1': Failed to start device "eth0": Failed to create the veth interfaces "vethb8f2c314" and "vethfb87ff61": Failed adding link: Failed to run: ip link add name vethb8f2c314 up type veth peer name vethfb87ff61 address 00:16:3e:59:f4:9f: read |0: bad file descriptor
Also now starting to see errors on migration on ZFS with same bad descriptor error:
2025-07-07T10:47:01.3861089Z l1:c2: error: Failed to run: /home/runner/go/bin/lxd forkstart c2 /tmp/lxd-test.tmp.68fx/gH6/containers /tmp/lxd-test.tmp.68fx/gH6/logs/c2/lxc.conf: read |0: bad file descriptor
Steps to reproduce
N/A
Information to attach
- [ ] Any relevant kernel output (
dmesg) - [ ] Instance log (
lxc info NAME --show-log) - [ ] Instance configuration (
lxc config show NAME --expanded) - [ ] Main daemon log (at
/var/log/lxd/lxd.logor/var/snap/lxd/common/lxd/logs/lxd.log) - [ ] Output of the client with
--debug - [ ] Output of the daemon with
--debug(or uselxc monitorwhile reproducing the issue)
@simondeziel @mihalicyn could we be running out of file descriptors in the GH test runners?
Hit this with zfs driver here: https://github.com/canonical/lxd/actions/runs/16113976891/job/45464219802?pr=15827
Seems to have been better today.
2025-07-09T16:40:39.0201361Z Error: Failed creating instance from image: Failed to run: zfs clone lxdtest-zAo/images/07402ec034c08201e74a530d4e2178e61ba0651090eb76e4d948b9a4b901922f@readonly lxdtest-zAo/containers/concurrent-23: read |0: bad file descriptor
https://productionresultssa14.blob.core.windows.net/actions-results/4c16486a-e999-45ac-b896-001c7c50b2d1/workflow-job-run-3f90f198-1f03-58f5-bdc6-5bbc49cef449/logs/job/job-logs.txt?rsct=text%2Fplain&se=2025-07-09T18%3A46%3A50Z&sig=TxjazoyNo%2B1djoS5n4s0ZuDqVGbhJpLp4LvHxapczMM%3D&ske=2025-07-10T05%3A50%3A55Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2025-07-09T17%3A50%3A55Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2025-05-05&sp=r&spr=https&sr=b&st=2025-07-09T18%3A36%3A45Z&sv=2025-05-05
Seems to be zfs related
Just seen it here on a btrfs run: https://github.com/canonical/lxd/actions/runs/16175768884/job/45660861642#step:15:93486
Good to know thanks!
Another btrfs one: https://github.com/canonical/lxd/actions/runs/16175156579/job/45658813510?pr=15990
Another btrfs one: https://github.com/canonical/lxd/actions/runs/16261989043/job/45909663658#step:15:93851
I just ran into a bad file descriptor problem in the incremental_copy test so this doesn't match the initial report affecting the concurrent test but every previous reports of "affected this run too" point to the incremental_copy test as well...
So either we have 2 separated bug or they share a common root cause.
@simondeziel how about we disable the concurrent tests for now and see if that appeases the incremental_copy ones too, as the increment_copy one runs after the concurrent tests, so potentially there's some state being messed up somewhere between them?
@simondeziel how about we disable the concurrent tests for now and see if that appeases the incremental_copy ones too, as the increment_copy one runs after the concurrent tests, so potentially there's some state being messed up somewhere between them?
That's an idea but instead of skipping them, I think it'd be more useful to relocate them ( #16117) and see if the problem moves to another test.
Failed here in basic_usage https://productionresultssa12.blob.core.windows.net/actions-results/166cacff-aa8b-4cb8-8d0d-cbb34f67f388/workflow-job-run-6662515c-d410-53e0-94cf-4257f2c81d25/logs/job/job-logs.txt?rsct=text%2Fplain&se=2025-07-30T08%3A31%3A09Z&sig=eAxWL999vW9AxVeA8sAU9EIPAVTuauk38PR3yWwUCsg%3D&ske=2025-07-30T18%3A04%3A33Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2025-07-30T06%3A04%3A33Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2025-05-05&sp=r&spr=https&sr=b&st=2025-07-30T08%3A21%3A04Z&sv=2025-05-05
2025-07-29T20:03:35.7467942Z - Instance: c2: Failed to start device "eth0": Failed to create the veth interfaces "vethb2588155" and "veth40c334e1": Failed adding link: Failed to run: ip link add name vethb2588155 up type veth peer name veth40c334e1 address 00:16:3e:65:da:98: read |0: bad file descriptor
I wonder if this is a go runtime bug.
Failed here in the lxd_benchmark test:
[Sep 11 19:43:55.567] Failed to start container 'benchmark-4': Failed to start device "eth0": Failed to create the veth interfaces "veth43f11667" and "vethedf37d76": Failed adding link: Failed to run: ip link add name veth43f11667 up type veth peer name vethedf37d76 address 00:16:3e:23:40:60: fork/exec /usr/sbin/ip: bad file descriptor
I wonder if replacing invocations of the ip command by direct netlink calls would help avoid the bad file descriptor issue.