tmt Support restart of test when it crashes

As discussed today, there's a use case for restarting a test when it crashes:

09:26:12                 out: :: [ 14:26:12 ] :: [   PASS   ] :: Command 'make all' (Expected 0, got 0)
09:26:12                 out: :: [ 14:26:12 ] :: [  BEGIN   ] :: Running 'echo 1 > /sys/kernel/vkm/write_um_crash'
09:26:12                 out: ./tmt-test-wrapper.sh.default-0: line 1:  6543 Segmentation fault      bash ./write_um.sh
09:26:12                 out: Shared connection to 10.26.28.203 closed.
09:26:12         Command returned '139'.

In this case, the user would like to see the test restarted - the test was killed by a kernel oops, and when restarted, it would take care of follow-up steps, like decoding the kernel dump.

After some discussion, the proposal would be:

a test key to indicate the test shall be restarted when it crashes. Might be a list of exit codes, or tmt might define the list of crash-like exit codes, and this key would be a simple flag.
- https://github.com/teemtee/tmt/pull/2870
a test key to indicate how many times the test should be restarted. We need to avoid endless loops, and tmt should give up at some point. The default might be a zero, or a reasonably low value - the value would not be used unless the first key is enabled anyway.
- https://github.com/teemtee/tmt/pull/2870
a test key to indicate whether to reboot the guest before restarting the test. In this particular case, there should be no guest reboot, the test needs to re-enter the environment as it is.
- https://github.com/teemtee/tmt/pull/2870
ew environment variable, similar to TMT_REBOOT_COUNT, but counting test restarts. With reboot disabled, the test might run multiple times while TMT_REBOOT_COUNT remains zero.
- https://github.com/teemtee/tmt/pull/2787

Feb 21 '24 15:02 happz

Hi @happz and @lukaszachy I found a workaround for my case. By using nohup it no longer causes the test to abort and it continues through the error.

        # Read only crash test
        rlRun "nohup echo 1 > /sys/kernel/vkm/write_ro_crash" "0-255"
        while (! ping -q -c 1 ${SOC///*}); do
            sleep 5
        done
        rlRun "dmesg > dmesg-crash.log"
        rlAssertGrep "Unable to handle kernel write to read-only memory" dmesg-crash.log

result:

15:00:13                 out: :: [ 20:00:13 ] :: [  BEGIN   ] :: Running 'nohup echo 1 > /sys/kernel/vkm/write_ro_crash'
15:00:13                 out: /usr/share/beakerlib/testing.sh: line 896:  1467 Segmentation fault      nohup echo 1 > /sys/kernel/vkm/write_ro_crash
15:00:13                 out: :: [ 20:00:13 ] :: [   PASS   ] :: Command 'nohup echo 1 > /sys/kernel/vkm/write_ro_crash' (Expected 0-255, got 139)
15:00:13                 out: PING 10.26.28.203 (10.26.28.203) 56(84) bytes of data.
15:00:13                 out: 
15:00:13                 out: --- 10.26.28.203 ping statistics ---
15:00:13                 out: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
15:00:13                 out: rtt min/avg/max/mdev = 0.046/0.046/0.046/0.000 ms
15:00:13                 out: :: [ 20:00:13 ] :: [  BEGIN   ] :: Running 'dmesg > dmesg-crash.log'
15:00:13                 out: :: [ 20:00:13 ] :: [   PASS   ] :: Command 'dmesg > dmesg-crash.log' (Expected 0, got 0)
15:00:13                 out: :: [ 20:00:13 ] :: [   PASS   ] :: File 'dmesg-crash.log' should contain 'Unable to handle kernel write to read-only memory'

Feb 27 '24 20:02 sbertramrh

Hello, @happz and @lukaszachy

I wrote a test that forcibly perform a stack underflow within a kernel module, that causes a BUG and subsequent restart after configuring 5 seconds of kernel.panic with sysctl

[ 1748.996748] BUG: unable to handle page fault for address: ffffaa90401e8000 [ 1748.996751] #PF: supervisor read access in kernel mode [ 1748.996752] #PF: error_code(0x0000) - not-present page [ 1748.996753] PGD 1800067 P4D 1800067 PUD 1a0e067 PMD 1a18067 PTE 0 [ 1748.996759] Oops: 0000 [#1] PREEMPT_RT SMP NOPTI [ 1748.996762] CPU: 3 PID: 50 Comm: ksoftirqd/3 Tainted: G OE X ------- --- 5.14.0-427.380.el9iv.x86_64 #1 [ 1748.996765] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc38 05/24/2023 [ 1748.996766] RIP: 0010:tasklet_fn+0x66/0x78 [stackman] [ 1748.996770] Code: 75 02 eb fe 58 ff c8 75 fb eb 1a 48 c7 44 24 10 79 56 34 12 e8 a7 fe ff ff 48 c7 c7 f8 10 86 c0 e8 8e fe 75 f2 b8 00 00 01 00 <58> ff c8 75 fb 48 c7 c7 b6 10 86 c0 5b e9 77 fe 75 f2 90 90 90 90

I tried with 'rstrnt-prepare-reboot' before loading the module that causes the crash, but tmt disconnects, tries to rsync and times out.

I think this one and other two tests for testing memory violation handling within the kernel are cases in favor of implementing this feature.

Mar 11 '24 13:03 pablmart

@pablmart hello, could you share the test? I'd like to use it as a reproducer when working on the feature.

Mar 25 '24 14:03 happz

@pablmart hello, could you share the test? I'd like to use it as a reproducer when working on the feature.

Yes the test is on the same repo linked in the above comment mentioning 'rstrnt-prepare-reboot':

kernel-stack-overflow-udnerflow-scribbling

Mar 25 '24 16:03 pablmart

I encountered a similar problem when testing ftrace= kernel parameter with tmt run.

Test with auto-osbuild-qemu-rhivos9-qa-ostree-aarch64-7874633.e1769674.qcow2.xz by manual

The available tracers are: $cat /sys/kernel/debug/tracing/available_tracers timerlat osnoise hwlat blk function_graph wakeup_dl wakeup_rt wakeup function nop

Install a vm with above image
export CMDLINEARGS="ftrace=timerlat"
rpm-ostree kargs --append-if-missing="${CMDLINEARGS##-}" --import-proc-cmdline
systemctl reboot Then the host cannot ssh connect again. Only "timerlat" and "osnoise" make host panic.

Apr 10 '24 03:04 weiwang-linda

Kicking off the implementation of the actual test restart in https://github.com/teemtee/tmt/pull/2870. It does have some rough edges, although there is a test that passes.

I plan to run it with the kernel-stack-overflow-udnerflow-scribbling test provided by @pablmart, feel free to experiment too.

One piece we need to address ASAP - naming. I picked some names for new keys, but they are ugly and I don't like them. I can change them easily, but I'm out of ideas - feel free to propose changes here as well, besides the actual bugs and issues :)

Apr 17 '24 15:04 happz

A similar case: what if the test does not crash, but triggers a reboot, e.g. through Ansible role, unable to use tmt-reboot? This would manifest as a broken SSH session:

                out: TASK [sap_general_preconfigure : Flush handlers] *******************************
                out: 
                out: RUNNING HANDLER [sap_general_preconfigure : Reboot the managed node] ***********
                out: Shared connection to restqe01 closed.
            cmd: rsync --version
            err: ssh: connect to host restqe01 port 22: Connection refused
            cmd: dnf --version
            err: ssh: connect to host restqe01 port 22: Connection refused
            cmd: rpm-ostree --version
            err: ssh: connect to host restqe01 port 22: Connection refused
            cmd: yum install -y rsync
            err: ssh: connect to host restqe01 port 22: Connection refused

Apr 29 '24 12:04 happz

The MR 2870 solves the issue with the kernel-stack-overflow-underflow-scribbling test. Many thanks!

May 03 '24 16:05 pablmart

Hi @pablmart,

This is Coiby from the kernel debug sst. I'm considering adopting tmt for https://github.com/rhkdump/kdump-utils tests. In our tests, we need to trigger a kernel crash intentionally and then check if the crash dump can be collected. I want to study your kernel-stack-overflow-underflow-scribbling test to learn to make use #2870 but it's gone now. Can you re-share it with me? Thanks!

Jul 18 '24 09:07 coiby

Hi @happz,

I wrote a mutihost test which is to dump a kernel crash to a remote NFS server . but unfortunately it failed with an error,

                        kdump: Starting kdump: [OK]
                        :: [ 10:08:20 ] :: [   PASS   ] :: Command 'kdumpctl restart' (Expected 0, got 0)
                        :: [ 10:08:20 ] :: [  BEGIN   ] :: Running 'echo 1 > /proc/sys/kernel/sysrq'
                        :: [ 10:08:20 ] :: [   PASS   ] :: Command 'echo 1 > /proc/sys/kernel/sysrq' (Expected 0, got 0)
                        client_loop: send disconnect: Broken pipe
                    00:00:28 errr /client-test/tests/client (on client) (beakerlib: State 'started') [1/1]
                    journal.txt: /var/tmp/tmt/run-035/plans/kdump/execute/data/guest/client/client-test/tests/client-4/journal.txt
            summary: 4 tests passed and 1 error

With restart-on-exit-code provided by #2870, I expected the test will be restarted after a kernel panic,

diff --git a/tests/client/main.fmf b/tests/client/main.fmf
index d74446f..261b8bb 100644
--- a/tests/client/main.fmf
+++ b/tests/client/main.fmf
@@ -1,3 +1,5 @@
 summary: Dump kernel crash to an NFS server
 test: ./test.sh
 framework: beakerlib
+restart-on-exit-code: 79
+restart-max-count: 5

But unfortunately it doesn't work. It seems I miss something? Can you provide a clue? Thanks!

Jul 18 '24 12:07 coiby

Oh, I notice the error is from beakerlib

00:00:28 errr /client-test/tests/client (on client) (beakerlib: State 'started') [1/1]
# or
00:00:28 errr /client-test/tests/client (on client) (beakerlib: State 'imcomplete') [1/1]

And after digging into the documentation and some trial-and-error, eventually I find using tmt-reboot -c "echo c > /proc/sysrq-trigger" (instead of rlRun "echo c > /proc/sysrq-trigger") to trigger a kernel panic can lead to green test results,

# tmt run tests discover provision -h virtual -c system prepare execute report finish 
/var/tmp/tmt/run-017

/plans/kdump
    discover
        how: fmf
        name: client-setup
        directory: /root/kdump-tests
        tests: /setup/kdump
        how: fmf
        name: server-setup
...
        execute task #4: server-test on server
        how: tmt

    
        summary: 4 tests executed
    report
        how: display
        summary: 4 tests passed

Jul 19 '24 06:07 coiby

@happz, isn't this one covered by the following two pull requests?

https://github.com/teemtee/tmt/pull/2787
https://github.com/teemtee/tmt/pull/2870

Mar 26 '25 23:03 psss

Indeed, one less issue to worry about \o/

Mar 27 '25 07:03 happz