tracee
tracee copied to clipboard
some integration tests time out in minimal NixOS test env
Description
Please treat this as low urgency. I am trying to package the latest version of tracee for NixOS and have issue making all of the integration tests pass:
=== RUN Test_EventsDependencies/non_existing_probe_function_with_sanity
logger.go:19: --- setting test logger ---
dependencies_test.go:146: --- started tracee ---
event_filters_test.go:2483: >>> running: cp /run/current-system/sw/bin/ls /tmp/ls
tracee.go:219: . waiting for at least 0 event(s) for 5s
tracee.go:226: . got 0 event(s) so far
tracee.go:228: . done waiting for 0 event(s)
event_filters_test.go:2483: >>> running: /tmp/ls
tracee.go:208: . waiting events collection for 1s
tracee.go:219: . waiting for at least 1 event(s) for 5s
tracee.go:226: . got 0 event(s) so far
tracee.go:226: . got 0 event(s) so far
tracee.go:226: . got 0 event(s) so far
tracee.go:226: . got 0 event(s) so far
tracee.go:226: . got 0 event(s) so far
tracee.go:232: . done waiting for 1 event(s)
dependencies_test.go:184: Test Test_EventsDependencies/non_existing_probe_function_with_sanity failed: timed out on waiting for 1 event(s)
logger.go:31: --- restoring default logger ---
dependencies_test.go:208: --- stopped tracee ---
=== NAME Test_EventsDependencies
dependencies_test.go:216: found unexpected goroutines:
[Goroutine 282 in state chan send, with github.com/aquasecurity/tracee/tests/testutils.TestLogs.func1 on top of the stack:
github.com/aquasecurity/tracee/tests/testutils.TestLogs.func1()
github.com/aquasecurity/tracee/tests/testutils/logger.go:118 +0x265
created by github.com/aquasecurity/tracee/tests/testutils.TestLogs in goroutine 281
github.com/aquasecurity/tracee/tests/testutils/logger.go:92 +0x157
Goroutine 349 in state chan send, with github.com/aquasecurity/tracee/tests/testutils.TestLogs.func1 on top of the stack:
github.com/aquasecurity/tracee/tests/testutils.TestLogs.func1()
github.com/aquasecurity/tracee/tests/testutils/logger.go:118 +0x265
created by github.com/aquasecurity/tracee/tests/testutils.TestLogs in goroutine 348
github.com/aquasecurity/tracee/tests/testutils/logger.go:92 +0x157
Goroutine 416 in state chan send, with github.com/aquasecurity/tracee/tests/testutils.TestLogs.func1 on top of the stack:
github.com/aquasecurity/tracee/tests/testutils.TestLogs.func1()
github.com/aquasecurity/tracee/tests/testutils/logger.go:118 +0x265
created by github.com/aquasecurity/tracee/tests/testutils.TestLogs in goroutine 415
github.com/aquasecurity/tracee/tests/testutils/logger.go:92 +0x157
]
do you know what could cause these tests to timeout? I am using the libbpf already packaged in nixpkgs instead of the one in the repo and NixOS doesn't have the same FHS as Ubuntu but other than that there should be no difference.
The following tests are failing, mostly due to timeouts:
"Test_EventFilters/comm:_event:_data:_trace_event_magic_write_set_in_multiple_policies_using_multiple_filter_types"
"Test_EventFilters/comm:_event:_data:_trace_event_security_file_open_and_magic_write_using_multiple_filter_types"
"Test_EventFilters/comm:_event:_data:_trace_event_security_file_open_and_magic_write_using_multiple_filter_types_combined"
"Test_EventFilters/comm:_event:_data:_trace_event_security_file_open_set_in_multiple_policies_\\(with_and_without_in-kernel_filter\\)"
"Test_EventFilters/comm:_event:_data:_trace_event_security_file_open_set_in_multiple_policies_using_multiple_filter_types"
"Test_EventFilters/comm:_event:_data:_trace_event_set_in_a_specific_policy_with_data_from_ls_command"
"Test_EventFilters/comm:_event:_trace_events_set_in_two_specific_policies_from_ls_and_uname_commands"
"Test_EventFilters/pid:_event:_data:_trace_event_sched_switch_with_data_from_pid_0"
"Test_EventsDependencies/non_existing_ksymbol_dependency_with_sanity"
"Test_EventsDependencies/non_existing_probe_function_with_sanity"
"Test_EventsDependencies/sanity_of_exec_test_event"
"Test_TraceeCapture/capture_packet_context"
Output of tracee version:
[root@machine:~]# /nix/store/2pvmmgvbmqlncv3ar02wr74810lvkvbk-tracee-0.23.1/bin/tracee version
Tracee version: v0.23.1
Output of uname -a:
[root@machine:~]# uname -a
Linux machine 6.12.28 #1-NixOS SMP PREEMPT_DYNAMIC Fri May 9 07:50:53 UTC 2025 x86_64 GNU/Linux
Additional details
@timhae thanks for reporting us. @NDStrahilevitz @rscampos does this ring some bell?
@timhae I've tried to replicate it without success so far. It would be possible to get a failing -test.shuffle seed from your env? You'll see it at the start of the tests, like:
-test.shuffle 1748898113057198113
=== RUN Test_EventsDependencies
...
Thanks for looking into this, I have uploaded the complete logs here (line 1 to ~1000 are integration test logs, you can skip those). Unfortunately I don't see any shuffle entry in there..
A couple of notes regarding the packaging process:
- besides using the nixpkgs libbpf we patch a couple of source files here,those are replacements in the corresponding source files to make the tests compatible with the filesystem layout in NixOS
- test binary is built like this
- and here is the test setup that copies some binaries to the expected locations
none of those modifications change anything in a substantial way, at least to my understanding.
I noticed that the integration tests in your package are not currently running with test shuffling enabled: https://github.com/aquasecurity/tracee/blob/16f89483756c5b082621ff4179ae94200aabb651/Makefile#L895
If possible, consider enabling it. Shuffling test execution can help surface hidden test causality by running them in a non-deterministic order. When a failure occurs, the seed can be used to reproduce the issue more consistently.
Thanks for the feedback, I have changed the integration test package to be more like the Makefile target:
CGO_LDFLAGS="$(pkg-config --libs libbpf)" go test -tags core,ebpf,integration -race -c -o $GOPATH/tracee-integration/ ./tests/integration/...
and adapted the call accordingly: mkdir /tmp/integration && cd /tmp/integration && integration.test -test.v -test.shuffle=on -test.count=1 -test.parallel=1, that produces this log.
@timhae even using the seed I wasn't able to reproduce the goroutine leak. Anyway, I've pushed a change that might solve this. Would you mind testing it?
I was seeing different integration tests timeout randomly. Found that the VM had only 3.5GB of memory and swapping was slowing everything down.
sorry for not getting back earlier, I have tried the patch that you linked @geyslan logs are here I also increased the resources of the test vm (1 core to 4 and 1gb of memory to 4gb, there is no swap so no swapping) now only these tests fail:
Test_EventFilters/comm:_event:_data:_trace_event_security_file_open_and_magic_write_using_multiple_filter_types
Test_EventFilters/comm:_event:_data:_trace_event_security_file_open_set_in_multiple_policies_(with_and_without_in-kernel_filter)
Test_EventFilters/comm:_event:_data:_trace_event_security_mmap_file_using_multiple_filter_types
Test_EventFilters/comm:_event:_data:_trace_event_set_in_a_specific_policy_with_data_from_ls_command
Test_EventFilters/comm:_event:_trace_events_set_in_two_specific_policies_from_ls_and_uname_commands
Test_EventsDependencies/non_existing_ksymbol_dependency_with_sanity
Test_EventsDependencies/non_existing_probe_function_with_sanity
Test_EventsDependencies/sanity_of_exec_test_event
I can try increasing the memory even more and check if that resolves the remaining timeouts since the github runners use 16gb of ram
I've made a bunch of changes/fixes in this branch https://github.com/geyslan/tracee/tree/4799-fix-integration, would you mind testing it again?
Sure, logs are here. Seems like increasing the memory to 16 GB did not help with the timeouts
Ok, increasing the memory is probably just misaligning the stars. The timeout ok, it should happend depending on the resources available, however the goroutine leak is a problem. Anyhow, please check again the latest changes and let me know the results on which env setup.
So, are you still facing timeouts even with 4 cores and 4GB? If so, would you mind checking the events that are failing individually by running the tracee binary instead of the integration tests. If we still get failures, please reopen this.