storage ci: several changes related to new qemu test, ansible-lint, python versions, ubuntu versions

There is a new QEMU based test which uses the qemu/kvm capability of github action runners. This is the basis for new bootc/image mode tests which we will be rolling out in the near future.

ansible-lint requires that the collection path is set so that the requirements it installs are installed in the correct place.

There has been some general github action deprecation of python versions and ubuntu versions that we have had to fix.

Remove CONTRIBUTOR from the list of users who can trigger citest.

For more information, see

https://github.com/linux-system-roles/.github/pull/98
https://github.com/linux-system-roles/.github/pull/94
https://github.com/linux-system-roles/.github/pull/93
https://github.com/linux-system-roles/.github/pull/92
https://github.com/linux-system-roles/.github/pull/91

Apr 08 '25 18:04 richm

Uh, qemu tests run into ENOSPC on the host. I suppose we can tone down the space requirements in the test and make smaller images etc., but I haven't looked yet. I can if you want me to.

Apr 09 '25 10:04 martinpitt

On F42, the runner did not ENOSPC, it just failed the tests. The first failed test on CI is tests/tests_luks.yml. tox -e qemu-ansible-core-2.17 -- --image-name fedora-42 --log-level=debug --debug tests/tests_luks.yml by itself passes, so this is some complication in batch mode.

In F42, I see a large "/dev/dm-1 4.0G 110M 3.9G 3% /opt/test1" which is being created during the test in the VM (PV on /dev/sda), but it does get cleaned up again. I'm running watch -n1 du -hsc /tmp/tmp* on the host to check the usage of the sparse backend files, but curiously they always stay at 0 bytes, i.e. aren't even being used; they are also cleaned up properly after the test ends.

I cannot reproduce the "Unable to find enough unused disks. Exiting playbook." either.

For the ENOSPC on C9, I ran tox -e qemu-ansible-core-2.16 -- --image-name centos-9 --log-level=debug --debug tests/tests_filesystem_one_disk.yml locally, and that also passed and didn't leave cruft behind.

Moving this to draft, most likely we'll just drop these tests on GH for storage, they are just too big. But I want to experiment with this a bit on my fork, with getting an interactive shell on a gh runner. I'll see if running a subset of tests works, or what is actually ENOSPCing.

My current idea is to generalize --skip-tags tests::infiniband to --skip-tags tests::no-github, and mark tests accordingly. An initial run with most tests skipped looks quite promising (the c10s failure is a VM download glitch, unrelated).

Next attempt, run with skipping tests::nvme,tests::scsi

Apr 11 '25 08:04 martinpitt

@richm the run on my fork is pretty okay -- some "honest" failures on f42 which I haven't yet looked into. But C9/C10 pass now, and it's a sane set of tests now. This includes the change from https://github.com/linux-system-roles/.github/pull/99 but if you prefer I'm happy to remove that here. Or feel free to remove it yourself if you'd like to land this soon otherwise.

Apr 14 '25 17:04 martinpitt

f42 failures:

✅ Unused disk

Many many tests fail with

Unable to find enough unused disks. Exiting playbook

This must be some leakage in batch mode. Running these individually works fine. I already investigated this a bit above, but no result yet. This will be my task today.

Fixed in "tests: Fix stratis block device clean up" commit.

tests_stratis

This has an "honest" failure:

The conditional check 'storage_test_pool.name in _stratis_pool_info.pools[0]['blockdevs']['datadevs'][0]['key_description']' failed. The error was: error while evaluating conditional (storage_test_pool.name in _stratis_pool_info.pools[0]['blockdevs']['datadevs'][0]['key_description']): 'dict object' has no attribute 'key_description'

This could be related to https://github.com/stratis-storage/stratisd/issues/3805 and be due to an API change. This reproduces locally.

tests_resize

failure:

TASK [linux-system-roles.storage : Manage the pools and volumes to match the specified state] *** volume 'foo-test1' cannot be resized from 9 GiB to 5 GiB: new size same as old size

This does not reproduce in isolation with tox -e qemu-ansible-core-2.17 -- --image-name fedora-42 --log-level=debug tests/tests_resize.yml

tests_lvm_pool_pv_grow.yml

failure

/dev/foo: already exists in filesystem

This smells like another cleanup failure. Now, just about every test uses "foo" as name, but I figure it's one of the other failures. I'll go through the "Clean up" actions once again with a fine comb and see which ones need fixing.

When I run it locally in isolation, it fails differently:

TASK [Verify each PV size] ***************************************************************************************************************************
task path: /var/home/martin/upstream/lsr/storage/tests/verify-pool-member-pvsize.yml:18
Tuesday 15 April 2025  10:45:41 +0200 (0:00:00.413)       0:00:36.427 ********* 
fatal: [/var/home/martin/.cache/linux-system-roles/fedora-42.qcow2]: FAILED! => {
    "assertion": "(dev_size.bytes - actual_pv_size.stdout | int) <= (4194304 + pv_pe_start.stdout | int)",
    "changed": false,
    "evaluated_to": false
}

MSG:

Unexpected difference between PV size and block device size: (device size: 1099511627776) (actual PV size:   8585740288)

#519 should fix CI run to fail "correctly".

✅ tests_remove_mount.yml

failure

cannot remove members 'sda' from pool 'foo' in safe mode

This also smells like cleanup failure, doesn't fail in isolation. Fixed by #519.

✅ tests_existing_lvm_pool.yml

failure

cannot remove members 'sda' from pool 'foo' in safe mode

Exactly the same failure as tests_remove_mount.yml, so cleanup failure. Also passes locally in isolation.

Fixing tests_resize cleanup didn't fix the follow-ups above. Testing run without stratis, and run without resize tests, to see which one breaks stuff. Update: neither change helped at all, the failures are independent.

Fixed by #519.

Apr 15 '25 05:04 martinpitt

OK, this run is down to 5 failures, from the previous 26. I.e. the "Unable to find enough unused disks" are gone with the stratis cleanup fix. This may be good enough to land now, also to avoid regressing the c9/c10 tests?

Apr 15 '25 08:04 martinpitt

tests_stratis

This has an "honest" failure:

The conditional check 'storage_test_pool.name in _stratis_pool_info.pools[0]['blockdevs']['datadevs'][0]['key_description']' failed. The error was: error while evaluating conditional (storage_test_pool.name in _stratis_pool_info.pools[0]['blockdevs']['datadevs'][0]['key_description']): 'dict object' has no attribute 'key_description'

This could be related to stratis-storage/stratisd#3805 and be due to an API change. This reproduces locally.

This is caused by the changes to Stratis metadata and encryption in Stratis 3.8.0, but the problem should be only in tests in verify-pool-stratis, not in the role itself (or blivet). I'll try to find a way to verify the test results that works with both old and new Stratis.

Apr 15 '25 10:04 vojtechtrefny

OK, this run is down to 5 failures, from the previous 26. I.e. the "Unable to find enough unused disks" are gone with the stratis cleanup fix. This may be good enough to land now, also to avoid regressing the c9/c10 tests?

Yes. Once you push the ansible-lint fix, we can merge this.

Apr 15 '25 13:04 richm

Pushed the ansible lint fix. I was trying to make some more progress on the other failures, but they are fighting back well. So yeah, let's address these in follow-ups.

Apr 15 '25 13:04 martinpitt