coreos-assembler icon indicating copy to clipboard operation
coreos-assembler copied to clipboard

kola/switch-kernel: rpm-ostree fails to switch from Default to RT Kernel

Open zonggen opened this issue 5 years ago • 8 comments

Bug Report

Environment

What operating system is being used to run coreos-assembler?

Fedora 30

What operating system is being assembled?

RHCOS

Is coreos-assembler running in Podman or Docker?

Podman

If Podman, is coreos-assembler running privileged or unprivileged?

Privileged

Expected Behavior

rpm-ostree command successfully switched kernel from default to rt kernel with command: rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core-4.18.0-147.5.1.rt24.98.el8_1.x86_64.rpm --install kernel-rt-modules-4.18.0-147.5.1.rt24.98.el8_1.x86_64.rpm --install kernel-rt-modules-extra-4.18.0-147.5.1.rt24.98.el8_1.x86_64.rpm

Actual Behavior

+ rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install ./kernel-rt/kernel-rt-core-4.18.0-147.5.1.rt24.98.el8_1.x86_64.rpm --install ./kernel-rt/kernel-rt-modules-4.18.0-147.5.1.rt24.98.el8_1.x86_64.rpm --install ./kernel-rt/kernel-rt-modules-m
Checking out tree ffd5b3c... done
Enabled rpm-md repositories: rhel8-baseos rhel8-appstream rhel8-rt
rpm-md repo 'rhel8-baseos' (cached); generated: 2020-02-27T15:31:54Z
rpm-md repo 'rhel8-appstream' (cached); generated: 2020-03-13T13:31:29Z
rpm-md repo 'rhel8-rt' (cached); generated: 2020-02-25T05:36:45Z
Importing rpm-md... done
Resolving dependencies... done
Applying 4 overrides and 4 overlays
Processing packages... done
Running pre scripts... done
Running post scripts... done
Running posttrans scripts... done
Writing rpmdb... done
error: Multiple subdirectories found in: usr/lib/modules

Reproduction Steps

  1. cosa kola switch-kernel -b rhcos --ignition-version v2 --kernel-rt ./kernel-rt
  2. ...

Other Information

Investigated a bit and found https://bugzilla.redhat.com/show_bug.cgi?id=1767215, which seems related. I've tried manually run the above rpm-ostree command inside RHCOS and the same behavior happened. And the origin of the error message is https://github.com/coreos/rpm-ostree/blob/2ee48c51fede72f1f0394c070c0f35946f3e1839/src/libpriv/rpmostree-kernel.c#L141, which only triggers when the directory /usr/lib/modules contains more than one sub-directories. But again,

[core@master-2 ~]$ ll /usr/lib/modules
total 4
drwxr-xr-x. 7 root root 4096 Jan  1  1970 4.18.0-147.el8.x86_64

This error did not occur when https://github.com/coreos/coreos-assembler/pull/1218 got merged. Am I missing anything..?

zonggen avatar Mar 13 '20 21:03 zonggen

@jlebon @cgwalters This looks like an rpm-ostree issue at the core...the only thing that jumped out in a search over there was https://github.com/coreos/rpm-ostree/issues/1933

miabbott avatar Mar 16 '20 15:03 miabbott

Yup, agreed this is likely an rpm-ostree problem. Will look into this.

jlebon avatar Mar 16 '20 15:03 jlebon

Hmm actually I can't reproduce this locally on a fresh RHCOS build. Both running rpm-ostree override remove directly and via cosa kola switch-kernel.

What RHCOS are you testing this on?

jlebon avatar Mar 16 '20 20:03 jlebon

Did fresh builds on two different machines, and ran cosa kola switch-kernel inside the cosa container.. Will try again tomorrow morning to see if it works

zonggen avatar Mar 16 '20 21:03 zonggen

So I've updated src/config and the error message went away. Though the rpm-ostree commands are now running without issue, cosa kola switch-kernel will sometimes fail at the second stage (switching RT back to Default) with error message:

Error: failed switch kernel test: failed switching from RT to Default Kernel: failed to run uname -v | grep -qv 'PREEMPT RT': Process exited with status 1

, same as observed in Jenkins pipeline (https://jenkins-rhcos-art.cloud.privileged.psi.redhat.com/job/rhcos-art-rhcos-4.5/76/console).

Since the related error is now gone, should we close this issue?

zonggen avatar Mar 17 '20 14:03 zonggen

Hmm yeah that's a different issue. No issues reusing this ticket if you'd prefer. Maybe try to run the same commands manually yourself until you hit the error? The kola SSH wrappers might be swallowing stderr.

jlebon avatar Mar 17 '20 18:03 jlebon

This is a pretty old issue. Two things:

  1. We should delete kola switch-kernel and make this a regular kola test instead (ideally external).
  2. Another way to switch kernels now is via layering, though that test is currently also broken: https://github.com/openshift/os/issues/1383. Ideally, we need to fix that too since it's only going to be more relevant going forward. That said, it still makes sense to test the client-side rpm-ostree override replace flow since that's still what the MCO does today.

The challenge with (1) is that this requires some support on the kola side because we need access to the kernel-rt RPMs. Those RPMs are now shipped as part of the extensions container. We could have a kola test tag like extensions-container which will tell kola to copy in the extensions container into the VM. One tricky bit there is that the extensions container is generated later in the pipeline, so it won't be available on the initial kola run we do. We'd have to add it near the kola testiso run we do instead, which happens after all artifacts are generated.

jlebon avatar Apr 29 '24 15:04 jlebon

Another way to switch kernels now is via layering, though that test is currently also broken: https://github.com/openshift/os/issues/1383. Ideally, we need to fix that too since it's only going to be more relevant going forward. That said, it still makes sense to test the client-side rpm-ostree override replace flow since that's still what the MCO does today.

Sorry, this is incorrect. https://github.com/openshift/os/issues/1383 doesn't use the layering flow, but also does it client-side.

The layering test lives in FCOS: https://github.com/coreos/fedora-coreos-config/blob/832c42ba3f406f88647621300aeecde30e9d14ef/tests/kola/rpm-ostree/kernel-replace. So then ideally, we generalize that test so it can work on both FCOS and SCOS/RHCOS.

jlebon avatar May 09 '24 16:05 jlebon

Let's close this one. The command was removed in https://github.com/coreos/coreos-assembler/pull/3825 in favour of external tests.

Relatedly, @c4rt0 is working on generalizing the existing layering test that we have in f-c-c: https://github.com/coreos/fedora-coreos-config/pull/3048

jlebon avatar Jul 15 '24 15:07 jlebon