OCPBUGS-62269 - Fix race condition causing missing audit log entries for rapid commands
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes a race condition in the JSON enricher that causes audit log entries to be missing or incomplete when commands are executed in rapid succession (e.g., touch test1 && touch test2 && touch test3).
Problem: When short-lived processes execute quickly, audit events arrive at the JSON enricher before the corresponding BPF events have been processed from the ring buffer. This results in:
- Empty
cmdLinefields - Missing
requestUID - Null
resourceinformation (pod/namespace/container)
Solution:
Implements a retry mechanism in dispatchSeccompLine() that attempts to fetch missing information multiple times with progressive delays (0ms, 10ms, 50ms):
- BPF Cache Retry: Retries cmdLine and requestUID lookup from BPF process cache, giving the BPF ring buffer poller time to process events
- Container Info Retry: Retries container information lookup if initially null
Since LogBuckets remain in cache for 60 seconds before being written to the audit log, there's ample time for retries to succeed while BPF events are processed asynchronously.
Testing Results:
- Before fix: ~33% success rate for rapid consecutive commands
- After fix: 100% success rate - all commands captured with complete information
Which issue(s) this PR fixes:
Fixes https://issues.redhat.com/browse/OCPBUGS-62269
Does this PR have test?
Yes - The fix was validated through:
- Manual testing with rapid consecutive commands (
oc execandoc rsh) - Existing unit tests continue to pass
- Build verification completed successfully
No new automated tests added as this is a timing-dependent race condition that's difficult to reliably reproduce in unit tests.
Special notes for your reviewer:
- This fix is scoped exclusively to
jsonenricher.go- no modifications to BPF components or ring buffer configuration - The retry mechanism only activates when information is missing, avoiding unnecessary delays for normal cases
- Progressive delays (0ms → 10ms → 50ms) balance responsiveness with reliability
- Early break on success minimizes latency impact
Does this PR introduce a user-facing change?
Fixed race condition in JSON enricher causing audit log entries to be missing or incomplete when commands are executed in rapid succession. Audit logs now reliably capture cmdLine, requestUID, and resource
information for all executed commands, even during bursts of activity.
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: BhargaviGudi Once this PR has been reviewed and has the lgtm label, please assign ccojocar for approval. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Hi @BhargaviGudi. Thanks for your PR.
I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Codecov Report
:x: Patch coverage is 28.88889% with 32 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 24.18%. Comparing base (11d77f4) to head (613d628).
:warning: Report is 1055 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #3052 +/- ##
===========================================
- Coverage 45.50% 24.18% -21.32%
===========================================
Files 79 125 +46
Lines 7782 17818 +10036
===========================================
+ Hits 3541 4309 +768
- Misses 4099 13225 +9126
- Partials 142 284 +142
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
/ok-to-test
/retest-required
@BhargaviGudi do you mind a rebase?