Support restart of test when it crashes
As discussed today, there's a use case for restarting a test when it crashes:
09:26:12 out: :: [ 14:26:12 ] :: [ PASS ] :: Command 'make all' (Expected 0, got 0)
09:26:12 out: :: [ 14:26:12 ] :: [ BEGIN ] :: Running 'echo 1 > /sys/kernel/vkm/write_um_crash'
09:26:12 out: ./tmt-test-wrapper.sh.default-0: line 1: 6543 Segmentation fault bash ./write_um.sh
09:26:12 out: Shared connection to 10.26.28.203 closed.
09:26:12 Command returned '139'.
In this case, the user would like to see the test restarted - the test was killed by a kernel oops, and when restarted, it would take care of follow-up steps, like decoding the kernel dump.
After some discussion, the proposal would be:
- a test key to indicate the test shall be restarted when it crashes. Might be a list of exit codes, or tmt might define the list of crash-like exit codes, and this key would be a simple flag.
- https://github.com/teemtee/tmt/pull/2870
- a test key to indicate how many times the test should be restarted. We need to avoid endless loops, and tmt should give up at some point. The default might be a zero, or a reasonably low value - the value would not be used unless the first key is enabled anyway.
- https://github.com/teemtee/tmt/pull/2870
- a test key to indicate whether to reboot the guest before restarting the test. In this particular case, there should be no guest reboot, the test needs to re-enter the environment as it is.
- https://github.com/teemtee/tmt/pull/2870
- ew environment variable, similar to
TMT_REBOOT_COUNT, but counting test restarts. With reboot disabled, the test might run multiple times whileTMT_REBOOT_COUNTremains zero.- https://github.com/teemtee/tmt/pull/2787
Hi @happz and @lukaszachy I found a workaround for my case. By using nohup it no longer causes the test to abort and it continues through the error.
# Read only crash test
rlRun "nohup echo 1 > /sys/kernel/vkm/write_ro_crash" "0-255"
while (! ping -q -c 1 ${SOC///*}); do
sleep 5
done
rlRun "dmesg > dmesg-crash.log"
rlAssertGrep "Unable to handle kernel write to read-only memory" dmesg-crash.log
result:
15:00:13 out: :: [ 20:00:13 ] :: [ BEGIN ] :: Running 'nohup echo 1 > /sys/kernel/vkm/write_ro_crash'
15:00:13 out: /usr/share/beakerlib/testing.sh: line 896: 1467 Segmentation fault nohup echo 1 > /sys/kernel/vkm/write_ro_crash
15:00:13 out: :: [ 20:00:13 ] :: [ PASS ] :: Command 'nohup echo 1 > /sys/kernel/vkm/write_ro_crash' (Expected 0-255, got 139)
15:00:13 out: PING 10.26.28.203 (10.26.28.203) 56(84) bytes of data.
15:00:13 out:
15:00:13 out: --- 10.26.28.203 ping statistics ---
15:00:13 out: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
15:00:13 out: rtt min/avg/max/mdev = 0.046/0.046/0.046/0.000 ms
15:00:13 out: :: [ 20:00:13 ] :: [ BEGIN ] :: Running 'dmesg > dmesg-crash.log'
15:00:13 out: :: [ 20:00:13 ] :: [ PASS ] :: Command 'dmesg > dmesg-crash.log' (Expected 0, got 0)
15:00:13 out: :: [ 20:00:13 ] :: [ PASS ] :: File 'dmesg-crash.log' should contain 'Unable to handle kernel write to read-only memory'
Hello, @happz and @lukaszachy
I wrote a test that forcibly perform a stack underflow within a kernel module, that causes a BUG and subsequent restart after configuring 5 seconds of kernel.panic with sysctl
[ 1748.996748] BUG: unable to handle page fault for address: ffffaa90401e8000 [ 1748.996751] #PF: supervisor read access in kernel mode [ 1748.996752] #PF: error_code(0x0000) - not-present page [ 1748.996753] PGD 1800067 P4D 1800067 PUD 1a0e067 PMD 1a18067 PTE 0 [ 1748.996759] Oops: 0000 [#1] PREEMPT_RT SMP NOPTI [ 1748.996762] CPU: 3 PID: 50 Comm: ksoftirqd/3 Tainted: G OE X ------- --- 5.14.0-427.380.el9iv.x86_64 #1 [ 1748.996765] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc38 05/24/2023 [ 1748.996766] RIP: 0010:tasklet_fn+0x66/0x78 [stackman] [ 1748.996770] Code: 75 02 eb fe 58 ff c8 75 fb eb 1a 48 c7 44 24 10 79 56 34 12 e8 a7 fe ff ff 48 c7 c7 f8 10 86 c0 e8 8e fe 75 f2 b8 00 00 01 00 <58> ff c8 75 fb 48 c7 c7 b6 10 86 c0 5b e9 77 fe 75 f2 90 90 90 90
I tried with 'rstrnt-prepare-reboot' before loading the module that causes the crash, but tmt disconnects, tries to rsync and times out.
I think this one and other two tests for testing memory violation handling within the kernel are cases in favor of implementing this feature.
@pablmart hello, could you share the test? I'd like to use it as a reproducer when working on the feature.
@pablmart hello, could you share the test? I'd like to use it as a reproducer when working on the feature.
Yes the test is on the same repo linked in the above comment mentioning 'rstrnt-prepare-reboot':
I encountered a similar problem when testing ftrace=
Test with auto-osbuild-qemu-rhivos9-qa-ostree-aarch64-7874633.e1769674.qcow2.xz by manual
The available tracers are: $cat /sys/kernel/debug/tracing/available_tracers timerlat osnoise hwlat blk function_graph wakeup_dl wakeup_rt wakeup function nop
- Install a vm with above image
- export CMDLINEARGS="ftrace=timerlat"
- rpm-ostree kargs --append-if-missing="${CMDLINEARGS##-}" --import-proc-cmdline
- systemctl reboot Then the host cannot ssh connect again. Only "timerlat" and "osnoise" make host panic.
Kicking off the implementation of the actual test restart in https://github.com/teemtee/tmt/pull/2870. It does have some rough edges, although there is a test that passes.
I plan to run it with the kernel-stack-overflow-udnerflow-scribbling test provided by @pablmart, feel free to experiment too.
One piece we need to address ASAP - naming. I picked some names for new keys, but they are ugly and I don't like them. I can change them easily, but I'm out of ideas - feel free to propose changes here as well, besides the actual bugs and issues :)
A similar case: what if the test does not crash, but triggers a reboot, e.g. through Ansible role, unable to use tmt-reboot? This would manifest as a broken SSH session:
out: TASK [sap_general_preconfigure : Flush handlers] *******************************
out:
out: RUNNING HANDLER [sap_general_preconfigure : Reboot the managed node] ***********
out: Shared connection to restqe01 closed.
cmd: rsync --version
err: ssh: connect to host restqe01 port 22: Connection refused
cmd: dnf --version
err: ssh: connect to host restqe01 port 22: Connection refused
cmd: rpm-ostree --version
err: ssh: connect to host restqe01 port 22: Connection refused
cmd: yum install -y rsync
err: ssh: connect to host restqe01 port 22: Connection refused
The MR 2870 solves the issue with the kernel-stack-overflow-underflow-scribbling test. Many thanks!
Hi @pablmart,
This is Coiby from the kernel debug sst. I'm considering adopting tmt for https://github.com/rhkdump/kdump-utils tests. In our tests, we need to trigger a kernel crash intentionally and then check if the crash dump can be collected. I want to study your kernel-stack-overflow-underflow-scribbling test to learn to make use #2870 but it's gone now. Can you re-share it with me? Thanks!
Hi @happz,
I wrote a mutihost test which is to dump a kernel crash to a remote NFS server . but unfortunately it failed with an error,
kdump: Starting kdump: [OK]
:: [ 10:08:20 ] :: [ PASS ] :: Command 'kdumpctl restart' (Expected 0, got 0)
:: [ 10:08:20 ] :: [ BEGIN ] :: Running 'echo 1 > /proc/sys/kernel/sysrq'
:: [ 10:08:20 ] :: [ PASS ] :: Command 'echo 1 > /proc/sys/kernel/sysrq' (Expected 0, got 0)
client_loop: send disconnect: Broken pipe
00:00:28 errr /client-test/tests/client (on client) (beakerlib: State 'started') [1/1]
journal.txt: /var/tmp/tmt/run-035/plans/kdump/execute/data/guest/client/client-test/tests/client-4/journal.txt
summary: 4 tests passed and 1 error
With restart-on-exit-code provided by #2870, I expected the test will be restarted after a kernel panic,
diff --git a/tests/client/main.fmf b/tests/client/main.fmf
index d74446f..261b8bb 100644
--- a/tests/client/main.fmf
+++ b/tests/client/main.fmf
@@ -1,3 +1,5 @@
summary: Dump kernel crash to an NFS server
test: ./test.sh
framework: beakerlib
+restart-on-exit-code: 79
+restart-max-count: 5
But unfortunately it doesn't work. It seems I miss something? Can you provide a clue? Thanks!
Oh, I notice the error is from beakerlib
00:00:28 errr /client-test/tests/client (on client) (beakerlib: State 'started') [1/1]
# or
00:00:28 errr /client-test/tests/client (on client) (beakerlib: State 'imcomplete') [1/1]
And after digging into the documentation and some trial-and-error, eventually I find using tmt-reboot -c "echo c > /proc/sysrq-trigger" (instead of rlRun "echo c > /proc/sysrq-trigger") to trigger a kernel panic can lead to green test results,
# tmt run tests discover provision -h virtual -c system prepare execute report finish
/var/tmp/tmt/run-017
/plans/kdump
discover
how: fmf
name: client-setup
directory: /root/kdump-tests
tests: /setup/kdump
how: fmf
name: server-setup
...
execute task #4: server-test on server
how: tmt
summary: 4 tests executed
report
how: display
summary: 4 tests passed
@happz, isn't this one covered by the following two pull requests?
- https://github.com/teemtee/tmt/pull/2787
- https://github.com/teemtee/tmt/pull/2870
Indeed, one less issue to worry about \o/