osbuild-composer
osbuild-composer copied to clipboard
Enable 8.8 and 9.2 test runnners
This pull request includes:
- [ ] adequate testing for the new functionality or fixed issue
- [ ] adequate documentation informing people about the change such as
- [ ] submit a PR for the guides repository if this PR changed any behavior described there: https://www.osbuild.org/guides/
FTR 9.2 nightly pipeline: https://gitlab.com/redhat/services/products/image-builder/ci/osbuild-composer/-/pipelines/698831011
Upgrade test failed
For the regular pipeline: ostree failures should be resolved in #3114.
I will look into the failing upgrade test.
For the regular pipeline: ostree failures should be resolved in #3114.
I'm afraid #3114 only fixes the ram issue - the 9 issue is:
ERROR loader attribute 'readonly' cannot be specified when firmware autoselection is enabled
which is probably related to a new virt-install version cc @henrywang
is this ready? we need it for #3130
is this ready? we need it for #3130
No. There is a problem with the 9.1 and 8.7 GA runners, see https://coreos.slack.com/archives/C0235DZB0DT/p1669643319777379
In ostree-simplified-installer.sh line 895 there is a condition for rhel-9.1 I guess this should be updated to 9.2?
@henrywang can you advice?
In ostree-simplified-installer.sh line 895 there is a condition for rhel-9.1 I guess this should be updated to 9.2?
@henrywang can you advice?
It's about the ignition test that I did for #3161, it should be updated to rhel-9.2
warning This PR introduces changes in at least one manifest (when comparing PR HEAD dff5103 with the main merge-base 13fdf04). Please review the changes. The changes can be found in the artifacts of the
Manifest-diff
job [0] asmanifests.diff
.[0] https://gitlab.com/redhat/services/products/image-builder/ci/osbuild-composer/-/jobs/3496180624/artifacts/browse
@thozza, @lavocatt, @ondrejbudai I see some removed packages but IDK if that's expected or an issue. Can you take a look?
warning This PR introduces changes in at least one manifest (when comparing PR HEAD dff5103 with the main merge-base 13fdf04). Please review the changes. The changes can be found in the artifacts of the
Manifest-diff
job [0] asmanifests.diff
. [0] https://gitlab.com/redhat/services/products/image-builder/ci/osbuild-composer/-/jobs/3496180624/artifacts/browse@thozza, @lavocatt, @ondrejbudai I see some removed packages but IDK if that's expected or an issue. Can you take a look?
After I reviewed the changes in this PR, there should be no differences in image manifests. So the diff looks suspicious. I tried to run the gen-manifests
tool on main
and first I got almost identical diff, but for vmdk
image on x86_64
and rhel-8.7
. Rerunning the tool again produced no diff. So it seems that there is some issue with depsolving in the tool itself (maybe a race condition when running too many workers). I'm not sure what's happening there and would defer to @achilleas-k.
My suspicion is that re-running the manifest-diff
job may actually produce no diff 🤔
So this seems like a general issue, not specific to this PR.
I also noticed that some CI jobs are failing on 9.2 images due to image-info
not being able to inspect images. @lavocatt is working on it for the manifest-db
repo, so we may need to make the same changed as part of this PR to fix the issue. Otherwise we can't merge it, because it would make CI always fail (on 9.2 and c9s).
Seeing a lot of errors in the log:
ERROR: Parser error at line:471 col:26
not well-formed (invalid token)
https://gitlab.com/redhat/services/products/image-builder/ci/osbuild-composer/-/jobs/3496180624
I haven't seen this one before but I've seen dnf print errors that don't raise exceptions and we don't catch through dnf-json.
Maybe we could catch stderr from dnf and look for errors to fail the depsolve job in such cases. We've talked to the dnf team a couple of times about similar things but never made any concrete decisions on it. Might be a good idea to catch these somehow, otherwise we could theoretically get this in prod, build an image off an incomplete manifest, and not realise it.
Maybe we could catch stderr from dnf and look for errors to fail the depsolve job in such cases. We've talked to the dnf team a couple of times about similar things but never made any concrete decisions on it. Might be a good idea to catch these somehow, otherwise we could theoretically get this in prod, build an image off an incomplete manifest, and not realise it.
Sounds reasonable to me... Let's discuss that early next year. As you've wrote, if such a thing could happen in production, then would produce incomplete images without knowing about it 🤔
8.8 pipeline is PASS: https://gitlab.com/redhat/services/products/image-builder/ci/osbuild-composer/-/pipelines/737840282
9.2 pipeline is FAIL for OSTree raw image test: https://gitlab.com/redhat/services/products/image-builder/ci/osbuild-composer/-/pipelines/737840432
Script '01_update_platforms_check.sh' FAILURE (exit code '2')
I am also seeing failures with various OStree tests outside of nightly pipeline (see statuses on this PR): https://gitlab.com/redhat/services/products/image-builder/ci/osbuild-composer/-/pipelines/737827291:
/usr/libexec/tests/osbuild-composer/ostree.sh: line 289: UPGRADE_PATH: unbound variable
ERROR internal error: qemu unexpectedly closed the monitor: 2023-01-04T12:16:28.063152Z qemu-kvm: cannot set up guest memory 'pc.ram': Cannot allocate memory
CC @henrywang ^^^
- The connection issue is caused by issue https://github.com/fedora-iot/fido-device-onboard-rs/issues/374. It's fixed already. Just re-run will work now. We have a discussion about this issue on slack thread https://coreos.slack.com/archives/C022TDCV3FH/p1672752884390489
- Can we run
ostree.sh
for ostree nightly test? That'll use less memory. The downstream Edge nightly test does not have this issue.
- RHEL + CentOS Stream: https://github.com/virt-s1/rhel-edge/projects/1
- Fedora 37/rawhide: https://github.com/virt-s1/rhel-edge/projects/2
Thanks!
- The connection issue is caused by issue
aio
error after updating serde_yaml to 0.9 not caught by CI fedora-iot/fido-device-onboard-rs#374. It's fixed already. Just re-run will work now. We have a discussion about this issue on slack thread https://coreos.slack.com/archives/C022TDCV3FH/p1672752884390489
I rebased this PR to the latest main
branch and retested today. The results are:
9.2 nightly pipeline - FAIL
OStree raw test, https://gitlab.com/redhat/services/products/image-builder/ci/osbuild-composer/-/jobs/3571891736, fails with:
🗳 Upgrade ostree image/commit[0m
ssh: connect to host 192.168.100.51 port 22: No route to host
Pipeline started from this PR - FAIL, see https://gitlab.com/redhat/services/products/image-builder/ci/osbuild-composer/-/pipelines/741497704
- Rebase OStree BIOS and Rebase OStree UEFI fail on 8.8;
- OStree simplified installer fails on 8.8 and 9.2
- New OStree failed on 8.8, 9.2 job still in progress
- OStree failed on 8.8, 9.2 job still in progress
2. Can we run `ostree.sh` for ostree nightly test?
@henrywang do you mean to execute ostree.sh
instead of ostree-raw-image.sh
for the nightly CI pipelines, iow the ones qualifying internal RHEL builds?
In any case it failed above with
ERROR internal error: process exited while connecting to monitor: 2023-01-09T10:44:32.025550Z qemu-kvm: cannot set up guest memory 'pc.ram': Cannot allocate memory
so it doesn't look very reliable even if we swap which test script is being executed.
@atodorov The ostree.sh covers the edge-commit image type, ostree-ng.sh covers edge-container and edge-installer image types, ostree-raw-image.sh covers edge-raw-image image type, and ostree-simplified-installer.sh covers edge-simplified-installer image type. The reason I suggest ostree.sh because it builds a tar ball, and uses less cpu and memory resource compared with container image, ISO, and RAW image.
Issue ssh: connect to host 192.168.100.51 port 22: No route to host
should be related with resource issue as well. The VM at this time should not be at running status, might be pause or stopped.
@atodorov @jrusz Do you know the reason why CI VM cannot have more resource on PSI openstack? If there's a limitation for openstack project, can we request two or more projects? Thanks.
@atodorov @jrusz Do you know the reason why CI VM cannot have more resource on PSI openstack? If there's a limitation for openstack project, can we request two or more projects? Thanks.
The OpenStack cluster is terribly over-subscribed and is maxed out from what I know. IIRC we had our resource quota increased early on but afterwards were denied a second quota increase.
The issue is also partly related to how current resource usage is calculated. For example you can request more RAM but that automatically increases the number of vCPUs used which automatically decreases the number of active VMs you can have at any given time. And that low number in itself will cause test jobs to queue for a long time and then be killed due to inactivity resulting in a snow ball effect.
You can try requesting a second or more projects, I'm skeptical that would be approved. Bear in mind that out CI provisioning however doesn't know how to work with 2 accounts in the same environment. We could probably work around that though.
@henrywang is it possible for Virt QE's CI environment to download RPMs built in osbuild-composer PRs, run tests agains them (same suite that you use for RHEL nightly testing is fine) and report statuses back to the PR ?
Your environment appears to be better suited for testing which requires nested virtualization and if we can send and consume notifications and statuses between this GitHub repository and Virt QE's CI environment we can give it a try.
Another option might be to onboard these tests to Testing Farm. We already have an MVP for Testing Farm planned for this quarter (see https://issues.redhat.com/browse/COMPOSER-1874, RH internal only, sorry). If it proves to be working well, we may want to start thinking of moving parts of the CI pipeline there.
GitHub Actions somehow got stuck on this, I had to force-push (same commit, just a new SHA).
There is just one downside, the ostree tests now take much longer, the simplified-installer
even 2 hours... I'll try to get back to having GCP runners as an option after I'm done with current tasks and we could offload the testing there and use bigger machines.