tracee icon indicating copy to clipboard operation
tracee copied to clipboard

some integration tests time out in minimal NixOS test env

Open timhae opened this issue 6 months ago • 8 comments

Description

Please treat this as low urgency. I am trying to package the latest version of tracee for NixOS and have issue making all of the integration tests pass:

=== RUN   Test_EventsDependencies/non_existing_probe_function_with_sanity
    logger.go:19:   --- setting test logger ---
    dependencies_test.go:146:   --- started tracee ---
    event_filters_test.go:2483:   >>> running: cp /run/current-system/sw/bin/ls /tmp/ls
    tracee.go:219:   . waiting for at least 0 event(s) for 5s
    tracee.go:226:   . got 0 event(s) so far
    tracee.go:228:   . done waiting for 0 event(s)
    event_filters_test.go:2483:   >>> running: /tmp/ls
    tracee.go:208:   . waiting events collection for 1s
    tracee.go:219:   . waiting for at least 1 event(s) for 5s
    tracee.go:226:   . got 0 event(s) so far
    tracee.go:226:   . got 0 event(s) so far
    tracee.go:226:   . got 0 event(s) so far
    tracee.go:226:   . got 0 event(s) so far
    tracee.go:226:   . got 0 event(s) so far
    tracee.go:232:   . done waiting for 1 event(s)
    dependencies_test.go:184: Test Test_EventsDependencies/non_existing_probe_function_with_sanity failed: timed out on waiting for 1 event(s)
    logger.go:31:   --- restoring default logger ---
    dependencies_test.go:208:   --- stopped tracee ---
=== NAME  Test_EventsDependencies
    dependencies_test.go:216: found unexpected goroutines:
        [Goroutine 282 in state chan send, with github.com/aquasecurity/tracee/tests/testutils.TestLogs.func1 on top of the stack:
        github.com/aquasecurity/tracee/tests/testutils.TestLogs.func1()
                github.com/aquasecurity/tracee/tests/testutils/logger.go:118 +0x265
        created by github.com/aquasecurity/tracee/tests/testutils.TestLogs in goroutine 281
                github.com/aquasecurity/tracee/tests/testutils/logger.go:92 +0x157
         Goroutine 349 in state chan send, with github.com/aquasecurity/tracee/tests/testutils.TestLogs.func1 on top of the stack:
        github.com/aquasecurity/tracee/tests/testutils.TestLogs.func1()
                github.com/aquasecurity/tracee/tests/testutils/logger.go:118 +0x265
        created by github.com/aquasecurity/tracee/tests/testutils.TestLogs in goroutine 348
                github.com/aquasecurity/tracee/tests/testutils/logger.go:92 +0x157
         Goroutine 416 in state chan send, with github.com/aquasecurity/tracee/tests/testutils.TestLogs.func1 on top of the stack:
        github.com/aquasecurity/tracee/tests/testutils.TestLogs.func1()
                github.com/aquasecurity/tracee/tests/testutils/logger.go:118 +0x265
        created by github.com/aquasecurity/tracee/tests/testutils.TestLogs in goroutine 415
                github.com/aquasecurity/tracee/tests/testutils/logger.go:92 +0x157
        ]

do you know what could cause these tests to timeout? I am using the libbpf already packaged in nixpkgs instead of the one in the repo and NixOS doesn't have the same FHS as Ubuntu but other than that there should be no difference.

The following tests are failing, mostly due to timeouts:

          "Test_EventFilters/comm:_event:_data:_trace_event_magic_write_set_in_multiple_policies_using_multiple_filter_types"
          "Test_EventFilters/comm:_event:_data:_trace_event_security_file_open_and_magic_write_using_multiple_filter_types"
          "Test_EventFilters/comm:_event:_data:_trace_event_security_file_open_and_magic_write_using_multiple_filter_types_combined"
          "Test_EventFilters/comm:_event:_data:_trace_event_security_file_open_set_in_multiple_policies_\\(with_and_without_in-kernel_filter\\)"
          "Test_EventFilters/comm:_event:_data:_trace_event_security_file_open_set_in_multiple_policies_using_multiple_filter_types"
          "Test_EventFilters/comm:_event:_data:_trace_event_set_in_a_specific_policy_with_data_from_ls_command"
          "Test_EventFilters/comm:_event:_trace_events_set_in_two_specific_policies_from_ls_and_uname_commands"
          "Test_EventFilters/pid:_event:_data:_trace_event_sched_switch_with_data_from_pid_0"
          "Test_EventsDependencies/non_existing_ksymbol_dependency_with_sanity"
          "Test_EventsDependencies/non_existing_probe_function_with_sanity"
          "Test_EventsDependencies/sanity_of_exec_test_event"
          "Test_TraceeCapture/capture_packet_context"

Output of tracee version:

[root@machine:~]# /nix/store/2pvmmgvbmqlncv3ar02wr74810lvkvbk-tracee-0.23.1/bin/tracee version
Tracee version: v0.23.1

Output of uname -a:

[root@machine:~]# uname -a
Linux machine 6.12.28 #1-NixOS SMP PREEMPT_DYNAMIC Fri May  9 07:50:53 UTC 2025 x86_64 GNU/Linux

Additional details

timhae avatar May 27 '25 15:05 timhae

@timhae thanks for reporting us. @NDStrahilevitz @rscampos does this ring some bell?

geyslan avatar May 27 '25 20:05 geyslan

@timhae I've tried to replicate it without success so far. It would be possible to get a failing -test.shuffle seed from your env? You'll see it at the start of the tests, like:

-test.shuffle 1748898113057198113
=== RUN   Test_EventsDependencies
...

geyslan avatar Jun 02 '25 21:06 geyslan

Thanks for looking into this, I have uploaded the complete logs here (line 1 to ~1000 are integration test logs, you can skip those). Unfortunately I don't see any shuffle entry in there..

A couple of notes regarding the packaging process:

  • besides using the nixpkgs libbpf we patch a couple of source files here,those are replacements in the corresponding source files to make the tests compatible with the filesystem layout in NixOS
  • test binary is built like this
  • and here is the test setup that copies some binaries to the expected locations

none of those modifications change anything in a substantial way, at least to my understanding.

timhae avatar Jun 04 '25 15:06 timhae

I noticed that the integration tests in your package are not currently running with test shuffling enabled: https://github.com/aquasecurity/tracee/blob/16f89483756c5b082621ff4179ae94200aabb651/Makefile#L895

If possible, consider enabling it. Shuffling test execution can help surface hidden test causality by running them in a non-deterministic order. When a failure occurs, the seed can be used to reproduce the issue more consistently.

geyslan avatar Jun 04 '25 17:06 geyslan

Thanks for the feedback, I have changed the integration test package to be more like the Makefile target:

    CGO_LDFLAGS="$(pkg-config --libs libbpf)" go test -tags core,ebpf,integration -race -c -o $GOPATH/tracee-integration/ ./tests/integration/...

and adapted the call accordingly: mkdir /tmp/integration && cd /tmp/integration && integration.test -test.v -test.shuffle=on -test.count=1 -test.parallel=1, that produces this log.

timhae avatar Jun 04 '25 17:06 timhae

that produces this log.

Nice, we'll work on that soon. Tks.

geyslan avatar Jun 04 '25 19:06 geyslan

@timhae even using the seed I wasn't able to reproduce the goroutine leak. Anyway, I've pushed a change that might solve this. Would you mind testing it?

geyslan avatar Jun 13 '25 21:06 geyslan

I was seeing different integration tests timeout randomly. Found that the VM had only 3.5GB of memory and swapping was slowing everything down.

djoreilly avatar Jun 20 '25 09:06 djoreilly

sorry for not getting back earlier, I have tried the patch that you linked @geyslan logs are here I also increased the resources of the test vm (1 core to 4 and 1gb of memory to 4gb, there is no swap so no swapping) now only these tests fail:

           Test_EventFilters/comm:_event:_data:_trace_event_security_file_open_and_magic_write_using_multiple_filter_types
           Test_EventFilters/comm:_event:_data:_trace_event_security_file_open_set_in_multiple_policies_(with_and_without_in-kernel_filter)
           Test_EventFilters/comm:_event:_data:_trace_event_security_mmap_file_using_multiple_filter_types
           Test_EventFilters/comm:_event:_data:_trace_event_set_in_a_specific_policy_with_data_from_ls_command
           Test_EventFilters/comm:_event:_trace_events_set_in_two_specific_policies_from_ls_and_uname_commands
           Test_EventsDependencies/non_existing_ksymbol_dependency_with_sanity
           Test_EventsDependencies/non_existing_probe_function_with_sanity
           Test_EventsDependencies/sanity_of_exec_test_event

I can try increasing the memory even more and check if that resolves the remaining timeouts since the github runners use 16gb of ram

timhae avatar Jun 24 '25 16:06 timhae

I've made a bunch of changes/fixes in this branch https://github.com/geyslan/tracee/tree/4799-fix-integration, would you mind testing it again?

geyslan avatar Jun 27 '25 22:06 geyslan

Sure, logs are here. Seems like increasing the memory to 16 GB did not help with the timeouts

timhae avatar Jun 29 '25 17:06 timhae

Ok, increasing the memory is probably just misaligning the stars. The timeout ok, it should happend depending on the resources available, however the goroutine leak is a problem. Anyhow, please check again the latest changes and let me know the results on which env setup.

geyslan avatar Jun 30 '25 13:06 geyslan

thanks for investigating this issue. I have captured the output of the latest main branch here

timhae avatar Jul 05 '25 18:07 timhae

So, are you still facing timeouts even with 4 cores and 4GB? If so, would you mind checking the events that are failing individually by running the tracee binary instead of the integration tests. If we still get failures, please reopen this.

geyslan avatar Jul 07 '25 12:07 geyslan