Driver's tests fail on SLES
Which jobs are failing:
drivers_test regardless of the operation mode (-k, -m, -b)
Which test(s) are failing:
GenericTracepoints.sched_proc_exit_reaper_in_the_same_group -> drivers_test exists without printing the summary
also other tests using clone3 are printing errors/warning but do not result in the test program bailing out:
SyscallExit.clone3X_create_child_with_2_threads SyscallExit.clone3X_child_clone_parent_flag SyscallExit.clone3X_child_new_namespace_from_child SyscallExit.clone3X_child_new_namespace_from_caller SyscallExit.clone3X_child_new_namespace_create_thread
GenericTracepoints.sched_proc_exit_prctl_subreaper GenericTracepoints.sched_proc_exit_child_namespace_reaper GenericTracepoints.sched_proc_exit_child_namespace_reaper_die
Since when has it been failing:
n/a
Test link:
locally on SLES 15 SP4
Reason for failure:
./test/drivers/test_suites/generic_tracepoints_suite/sched_process_exit.cpp has the following call to clone3 in child_func():
pid_t p2_t1_pid = syscall(__NR_clone3, &cl_args_child, sizeof(cl_args_child));
which will return EINVAL on SLES 15 SP4:
$ sudo strace -ff ./test/drivers/drivers_test -m ... clone(child_stack=0x1c6a3a0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|SIGCHLDstrace: Process 19773 attached ) = 19773 [pid 19738] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2, tv_nsec=0}, <unfinished ...> [pid 19773] clone3({flags=0, exit_signal=0, stack=NULL, stack_size=0, set_tid=[57006], set_tid_size=1}, 88) = -1 EINVAL (Invalid argument) [pid 19773] exit_group(1) = ? [pid 19738] <... clock_nanosleep resumed> <unfinished ...>) = ? [pid 19773] +++ exited with 1 +++ ...
as a comparison in e.g. Ubuntu 24.04.2 LTS (22.04, 20.04) there are no problems: ... clone(child_stack=0x5b75991be370, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|SIGCHLDstrace: Process 248053 attached ) = 248053 [pid 248018] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2, tv_nsec=0}, <unfinished ...> [pid 248053] clone3({flags=0, exit_signal=0, stack=NULL, stack_size=0, set_tid=[57006], set_tid_size=1}, 88strace: Process 57 006 attached ) = 57006 [pid 248053] exit(0) [pid 57006] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, <unfinished ...> [pid 248053] +++ exited with 0 +++ [pid 57006] <... clock_nanosleep resumed>0x5b75991be2c0) = 0 [pid 57006] exit(0) = ? [pid 57006] +++ exited with 0 +++ <... clock_nanosleep resumed>0x7fff03803dc0) = 0 kill(248053, SIGTERM) = -1 ESRCH (No such process) ...
Anything else we need to know:
/kind failing-test
I did a little test program where I basically transferred the same functionality from the referred test:
I get the same result.
but if I eliminate:
pid_t p2_t1 = 57006;
cl_args_child.set_tid = (uint64_t)&p2_t1;
cl_args_child.set_tid_size = 1;
i.e. without setting the tid, the program works. I guess you have special need to set the tid value, right?
Apparently SLES does not indeed like the used value, and a quick hack makes the test to pass:
diff --git a/test/drivers/test_suites/generic_tracepoints_suite/sched_process_exit.cpp b/test/drivers/test_suites/generic_tracepoints_suite/sched_process_exit.cpp
index 6d86442f2..beaee8224 100644
--- a/test/drivers/test_suites/generic_tracepoints_suite/sched_process_exit.cpp
+++ b/test/drivers/test_suites/generic_tracepoints_suite/sched_process_exit.cpp
@@ -384,7 +384,7 @@ TEST(GenericTracepoints, sched_proc_exit_child_namespace_reaper_die) {
#ifdef __NR_kill
static int child_func(void* arg) {
- pid_t p2_t1 = 57006;
+ pid_t p2_t1 = getpid() + 2;
clone_args cl_args_child = {};
cl_args_child.set_tid = (uint64_t)&p2_t1;
cl_args_child.set_tid_size = 1;
however, I'm not sure if this is the correct way, and also that similar kind of a "fix" would need to be applied to all places where syscall(__NR_clone3...) is used in the test suite.
I guess the problem here is related to SLES 15 SP4 limiting the tid range. I'm noticing that the tids are set to values greater than 32768... From this article (which is con SP3, but it should be the same for SP4) and, in general, on linux, there are multiple ways of configuring these limits. Could you please check those limits on your machine and/or quickly try to replace to values with something under 32768?
I can confirm that it explains the above behavior on my vanilla SLES, thanks!
$ cat /proc/sys/kernel/pid_max 32768
and if I change the value to something under 32768 in sched_proc_exit_reaper_in_the_same_group, the test passes.
I could do an PR if you think it is okay to just use lower values (which I think should be).
So far I have seen values being used in test/drivers/test_suites/syscall_exit_suite/clone3_x.cpp, test/drivers/test_suites/generic_tracepoints_suite/sched_process_exit.cpp and ./test/drivers/test_suites/generic_tracepoints_suite/sched_process_fork.cpp. However, I don't see errors from sched_process_fork but I suspect that it is not even invoked during the tests: if I skim through the test summary, I don't see any sched_process_fork tests.
I don't see any problem with lowering them, but maybe there is some reason why they are set to those specific values. An alternative would be to programmatically lower those values after fetching the PID_MAX value... @leogr @FedeDP WDYT?
I don't see any problem with lowering them, but maybe there is some reason why they are set to those specific values. An alternative would be to programmatically lower those values after fetching the PID_MAX value...
I agree, we could write a small get_test_pid helper that does exactly that, and we should be good to go.
Or, we could increase the PID_MAX value before running the test, and set it back after. But i'd prefer the former solution that does not touch any limit.
I agree, we could write a small get_test_pid helper that does exactly that, and we should be good to go.
Yes, I think we can proceed with this.
A quick update: just wanted to confirm (as expected) that a quick replace of "6XXXX" and "5XXXX" values with "2XXXX" worked fine in all tests mentioned above.
Some other issues is/issues are still causing: SyscallExit.recvmmsgX_ipv4_tcp_multiple_messages, SyscallExit.sendmmsg_multiple_messages_ipv4, SyscallExit.sendmmsg_multiple_messages_ipv6 and GenericTracepoints.page_fault_kernel to fail on SLES, though. Need to investigate more.
I'll make new issues (2 issues) of those failing tests, because the reason is different that in this issue. I group tests basing on reason for failure.
I'll make new issues (2 issues) of those failing tests, because the reason is different that in this issue. I group tests basing on reason for failure.
https://github.com/falcosecurity/libs/issues/2412 https://github.com/falcosecurity/libs/issues/2413
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Closing this issue because both linked issues are now solved.