lxd Concurrent instance launch test is flaky

Please confirm

[x] I have searched existing issues to check if an issue already exists for the bug I encountered.

Distribution

Ubuntu 22.04

Distribution version

Github runner

Output of "lxc info" or system info if it fails

N/A

Issue description

Since https://github.com/canonical/lxd/pull/15737 the test_concurrent tests have been failing intermittently for various reasons. Also we are seeing similar issues on the lxd_benchmark_basic test.

To keep track of the various errors to try and help identify the issue:

test_concurrent failures on ZFS:

Error: Failed creating instance from image: Failed to run: zfs set refreservation=none lxdtest-JL1/images/024b940c940c055d69aee9645c005483c212752075fc16281252a41ccfa9b91a: fork/exec /usr/sbin/zfs: bad file descriptor

lxd_benchmark_basic failures:

Failed to start container 'benchmark-1': Failed to start device "eth0": Failed to create the veth interfaces "vethb8f2c314" and "vethfb87ff61": Failed adding link: Failed to run: ip link add name vethb8f2c314 up type veth peer name vethfb87ff61 address 00:16:3e:59:f4:9f: read |0: bad file descriptor

Also now starting to see errors on migration on ZFS with same bad descriptor error:

2025-07-07T10:47:01.3861089Z l1:c2: error: Failed to run: /home/runner/go/bin/lxd forkstart c2 /tmp/lxd-test.tmp.68fx/gH6/containers /tmp/lxd-test.tmp.68fx/gH6/logs/c2/lxc.conf: read |0: bad file descriptor

Steps to reproduce

N/A

Information to attach

[ ] Any relevant kernel output (dmesg)
[ ] Instance log (lxc info NAME --show-log)
[ ] Instance configuration (lxc config show NAME --expanded)
[ ] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
[ ] Output of the client with --debug
[ ] Output of the daemon with --debug (or use lxc monitor while reproducing the issue)

Jul 07 '25 09:07 tomponline

@simondeziel @mihalicyn could we be running out of file descriptors in the GH test runners?

Jul 07 '25 10:07 tomponline

Hit this with zfs driver here: https://github.com/canonical/lxd/actions/runs/16113976891/job/45464219802?pr=15827

Jul 07 '25 13:07 skozina

Seems to have been better today.

Jul 08 '25 10:07 tomponline

2025-07-09T16:40:39.0201361Z Error: Failed creating instance from image: Failed to run: zfs clone lxdtest-zAo/images/07402ec034c08201e74a530d4e2178e61ba0651090eb76e4d948b9a4b901922f@readonly lxdtest-zAo/containers/concurrent-23: read |0: bad file descriptor

https://productionresultssa14.blob.core.windows.net/actions-results/4c16486a-e999-45ac-b896-001c7c50b2d1/workflow-job-run-3f90f198-1f03-58f5-bdc6-5bbc49cef449/logs/job/job-logs.txt?rsct=text%2Fplain&se=2025-07-09T18%3A46%3A50Z&sig=TxjazoyNo%2B1djoS5n4s0ZuDqVGbhJpLp4LvHxapczMM%3D&ske=2025-07-10T05%3A50%3A55Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2025-07-09T17%3A50%3A55Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2025-05-05&sp=r&spr=https&sr=b&st=2025-07-09T18%3A36%3A45Z&sv=2025-05-05

Jul 09 '25 18:07 tomponline

Seems to be zfs related

Jul 09 '25 18:07 tomponline

Just seen it here on a btrfs run: https://github.com/canonical/lxd/actions/runs/16175768884/job/45660861642#step:15:93486

Jul 09 '25 20:07 markylaing

Good to know thanks!

Jul 09 '25 20:07 tomponline

Another btrfs one: https://github.com/canonical/lxd/actions/runs/16175156579/job/45658813510?pr=15990

Jul 10 '25 08:07 skozina

Another btrfs one: https://github.com/canonical/lxd/actions/runs/16261989043/job/45909663658#step:15:93851

Jul 14 '25 10:07 markylaing

I just ran into a bad file descriptor problem in the incremental_copy test so this doesn't match the initial report affecting the concurrent test but every previous reports of "affected this run too" point to the incremental_copy test as well...

So either we have 2 separated bug or they share a common root cause.

Jul 28 '25 21:07 simondeziel

@simondeziel how about we disable the concurrent tests for now and see if that appeases the incremental_copy ones too, as the increment_copy one runs after the concurrent tests, so potentially there's some state being messed up somewhere between them?

Jul 29 '25 08:07 tomponline

@simondeziel how about we disable the concurrent tests for now and see if that appeases the incremental_copy ones too, as the increment_copy one runs after the concurrent tests, so potentially there's some state being messed up somewhere between them?

That's an idea but instead of skipping them, I think it'd be more useful to relocate them ( #16117) and see if the problem moves to another test.

Jul 29 '25 17:07 simondeziel

Failed here in basic_usage https://productionresultssa12.blob.core.windows.net/actions-results/166cacff-aa8b-4cb8-8d0d-cbb34f67f388/workflow-job-run-6662515c-d410-53e0-94cf-4257f2c81d25/logs/job/job-logs.txt?rsct=text%2Fplain&se=2025-07-30T08%3A31%3A09Z&sig=eAxWL999vW9AxVeA8sAU9EIPAVTuauk38PR3yWwUCsg%3D&ske=2025-07-30T18%3A04%3A33Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2025-07-30T06%3A04%3A33Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2025-05-05&sp=r&spr=https&sr=b&st=2025-07-30T08%3A21%3A04Z&sv=2025-05-05

2025-07-29T20:03:35.7467942Z  - Instance: c2: Failed to start device "eth0": Failed to create the veth interfaces "vethb2588155" and "veth40c334e1": Failed adding link: Failed to run: ip link add name vethb2588155 up type veth peer name veth40c334e1 address 00:16:3e:65:da:98: read |0: bad file descriptor

I wonder if this is a go runtime bug.

Jul 30 '25 08:07 tomponline

Failed here in the lxd_benchmark test:

[Sep 11 19:43:55.567] Failed to start container 'benchmark-4': Failed to start device "eth0": Failed to create the veth interfaces "veth43f11667" and "vethedf37d76": Failed adding link: Failed to run: ip link add name veth43f11667 up type veth peer name vethedf37d76 address 00:16:3e:23:40:60: fork/exec /usr/sbin/ip: bad file descriptor

Sep 11 '25 19:09 simondeziel

I wonder if replacing invocations of the ip command by direct netlink calls would help avoid the bad file descriptor issue.

Dec 12 '25 21:12 simondeziel