libs icon indicating copy to clipboard operation
libs copied to clipboard

Driver's tests fail on SLES

Open terror96 opened this issue 6 months ago • 11 comments

Which jobs are failing:

drivers_test regardless of the operation mode (-k, -m, -b)

Which test(s) are failing:

GenericTracepoints.sched_proc_exit_reaper_in_the_same_group -> drivers_test exists without printing the summary

also other tests using clone3 are printing errors/warning but do not result in the test program bailing out:

SyscallExit.clone3X_create_child_with_2_threads SyscallExit.clone3X_child_clone_parent_flag SyscallExit.clone3X_child_new_namespace_from_child SyscallExit.clone3X_child_new_namespace_from_caller SyscallExit.clone3X_child_new_namespace_create_thread

GenericTracepoints.sched_proc_exit_prctl_subreaper GenericTracepoints.sched_proc_exit_child_namespace_reaper GenericTracepoints.sched_proc_exit_child_namespace_reaper_die

Since when has it been failing:

n/a

Test link:

locally on SLES 15 SP4

Reason for failure:

./test/drivers/test_suites/generic_tracepoints_suite/sched_process_exit.cpp has the following call to clone3 in child_func():

pid_t p2_t1_pid = syscall(__NR_clone3, &cl_args_child, sizeof(cl_args_child));

which will return EINVAL on SLES 15 SP4:

$ sudo strace -ff ./test/drivers/drivers_test -m ... clone(child_stack=0x1c6a3a0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|SIGCHLDstrace: Process 19773 attached ) = 19773 [pid 19738] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2, tv_nsec=0}, <unfinished ...> [pid 19773] clone3({flags=0, exit_signal=0, stack=NULL, stack_size=0, set_tid=[57006], set_tid_size=1}, 88) = -1 EINVAL (Invalid argument) [pid 19773] exit_group(1) = ? [pid 19738] <... clock_nanosleep resumed> <unfinished ...>) = ? [pid 19773] +++ exited with 1 +++ ...

as a comparison in e.g. Ubuntu 24.04.2 LTS (22.04, 20.04) there are no problems: ... clone(child_stack=0x5b75991be370, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|SIGCHLDstrace: Process 248053 attached ) = 248053 [pid 248018] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=2, tv_nsec=0}, <unfinished ...> [pid 248053] clone3({flags=0, exit_signal=0, stack=NULL, stack_size=0, set_tid=[57006], set_tid_size=1}, 88strace: Process 57 006 attached ) = 57006 [pid 248053] exit(0) [pid 57006] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, <unfinished ...> [pid 248053] +++ exited with 0 +++ [pid 57006] <... clock_nanosleep resumed>0x5b75991be2c0) = 0 [pid 57006] exit(0) = ? [pid 57006] +++ exited with 0 +++ <... clock_nanosleep resumed>0x7fff03803dc0) = 0 kill(248053, SIGTERM) = -1 ESRCH (No such process) ...

Anything else we need to know:

terror96 avatar May 16 '25 12:05 terror96

/kind failing-test

terror96 avatar May 16 '25 12:05 terror96

I did a little test program where I basically transferred the same functionality from the referred test:

I get the same result.

but if I eliminate:

    pid_t p2_t1 = 57006;
    cl_args_child.set_tid = (uint64_t)&p2_t1;
    cl_args_child.set_tid_size = 1;

i.e. without setting the tid, the program works. I guess you have special need to set the tid value, right?

terror96 avatar May 16 '25 12:05 terror96

Apparently SLES does not indeed like the used value, and a quick hack makes the test to pass:

diff --git a/test/drivers/test_suites/generic_tracepoints_suite/sched_process_exit.cpp b/test/drivers/test_suites/generic_tracepoints_suite/sched_process_exit.cpp
index 6d86442f2..beaee8224 100644
--- a/test/drivers/test_suites/generic_tracepoints_suite/sched_process_exit.cpp
+++ b/test/drivers/test_suites/generic_tracepoints_suite/sched_process_exit.cpp
@@ -384,7 +384,7 @@ TEST(GenericTracepoints, sched_proc_exit_child_namespace_reaper_die) {
 
 #ifdef __NR_kill
 static int child_func(void* arg) {
-       pid_t p2_t1 = 57006;
+       pid_t p2_t1 = getpid() + 2;
        clone_args cl_args_child = {};
        cl_args_child.set_tid = (uint64_t)&p2_t1;
        cl_args_child.set_tid_size = 1;

however, I'm not sure if this is the correct way, and also that similar kind of a "fix" would need to be applied to all places where syscall(__NR_clone3...) is used in the test suite.

terror96 avatar May 16 '25 13:05 terror96

I guess the problem here is related to SLES 15 SP4 limiting the tid range. I'm noticing that the tids are set to values greater than 32768... From this article (which is con SP3, but it should be the same for SP4) and, in general, on linux, there are multiple ways of configuring these limits. Could you please check those limits on your machine and/or quickly try to replace to values with something under 32768?

ekoops avatar May 19 '25 07:05 ekoops

I can confirm that it explains the above behavior on my vanilla SLES, thanks!

$ cat /proc/sys/kernel/pid_max 32768

and if I change the value to something under 32768 in sched_proc_exit_reaper_in_the_same_group, the test passes.

I could do an PR if you think it is okay to just use lower values (which I think should be).

So far I have seen values being used in test/drivers/test_suites/syscall_exit_suite/clone3_x.cpp, test/drivers/test_suites/generic_tracepoints_suite/sched_process_exit.cpp and ./test/drivers/test_suites/generic_tracepoints_suite/sched_process_fork.cpp. However, I don't see errors from sched_process_fork but I suspect that it is not even invoked during the tests: if I skim through the test summary, I don't see any sched_process_fork tests.

terror96 avatar May 19 '25 08:05 terror96

I don't see any problem with lowering them, but maybe there is some reason why they are set to those specific values. An alternative would be to programmatically lower those values after fetching the PID_MAX value... @leogr @FedeDP WDYT?

ekoops avatar May 19 '25 08:05 ekoops

I don't see any problem with lowering them, but maybe there is some reason why they are set to those specific values. An alternative would be to programmatically lower those values after fetching the PID_MAX value...

I agree, we could write a small get_test_pid helper that does exactly that, and we should be good to go. Or, we could increase the PID_MAX value before running the test, and set it back after. But i'd prefer the former solution that does not touch any limit.

FedeDP avatar May 19 '25 09:05 FedeDP

I agree, we could write a small get_test_pid helper that does exactly that, and we should be good to go.

Yes, I think we can proceed with this.

ekoops avatar May 19 '25 09:05 ekoops

A quick update: just wanted to confirm (as expected) that a quick replace of "6XXXX" and "5XXXX" values with "2XXXX" worked fine in all tests mentioned above.

Some other issues is/issues are still causing: SyscallExit.recvmmsgX_ipv4_tcp_multiple_messages, SyscallExit.sendmmsg_multiple_messages_ipv4, SyscallExit.sendmmsg_multiple_messages_ipv6 and GenericTracepoints.page_fault_kernel to fail on SLES, though. Need to investigate more.

terror96 avatar May 19 '25 09:05 terror96

I'll make new issues (2 issues) of those failing tests, because the reason is different that in this issue. I group tests basing on reason for failure.

terror96 avatar May 19 '25 11:05 terror96

I'll make new issues (2 issues) of those failing tests, because the reason is different that in this issue. I group tests basing on reason for failure.

https://github.com/falcosecurity/libs/issues/2412 https://github.com/falcosecurity/libs/issues/2413

terror96 avatar May 20 '25 07:05 terror96

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Aug 18 '25 10:08 poiana

/remove-lifecycle stale

leogr avatar Aug 18 '25 13:08 leogr

Closing this issue because both linked issues are now solved.

terror96 avatar Oct 13 '25 12:10 terror96