security-profiles-operator icon indicating copy to clipboard operation
security-profiles-operator copied to clipboard

OCPBUGS-62269 - Fix race condition causing missing audit log entries for rapid commands

Open BhargaviGudi opened this issue 3 weeks ago • 7 comments

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes a race condition in the JSON enricher that causes audit log entries to be missing or incomplete when commands are executed in rapid succession (e.g., touch test1 && touch test2 && touch test3).

Problem: When short-lived processes execute quickly, audit events arrive at the JSON enricher before the corresponding BPF events have been processed from the ring buffer. This results in:

  • Empty cmdLine fields
  • Missing requestUID
  • Null resource information (pod/namespace/container)

Solution: Implements a retry mechanism in dispatchSeccompLine() that attempts to fetch missing information multiple times with progressive delays (0ms, 10ms, 50ms):

  1. BPF Cache Retry: Retries cmdLine and requestUID lookup from BPF process cache, giving the BPF ring buffer poller time to process events
  2. Container Info Retry: Retries container information lookup if initially null

Since LogBuckets remain in cache for 60 seconds before being written to the audit log, there's ample time for retries to succeed while BPF events are processed asynchronously.

Testing Results:

  • Before fix: ~33% success rate for rapid consecutive commands
  • After fix: 100% success rate - all commands captured with complete information

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-62269

Does this PR have test?

Yes - The fix was validated through:

  • Manual testing with rapid consecutive commands (oc exec and oc rsh)
  • Existing unit tests continue to pass
  • Build verification completed successfully

No new automated tests added as this is a timing-dependent race condition that's difficult to reliably reproduce in unit tests.

Special notes for your reviewer:

  • This fix is scoped exclusively to jsonenricher.go - no modifications to BPF components or ring buffer configuration
  • The retry mechanism only activates when information is missing, avoiding unnecessary delays for normal cases
  • Progressive delays (0ms → 10ms → 50ms) balance responsiveness with reliability
  • Early break on success minimizes latency impact

Does this PR introduce a user-facing change?

Fixed race condition in JSON enricher causing audit log entries to be missing or incomplete when commands are executed in rapid succession. Audit logs now reliably capture cmdLine, requestUID, and resource
information for all executed commands, even during bursts of activity.

BhargaviGudi avatar Nov 27 '25 15:11 BhargaviGudi

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Nov 27 '25 15:11 k8s-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: BhargaviGudi Once this PR has been reviewed and has the lgtm label, please assign ccojocar for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Nov 27 '25 15:11 k8s-ci-robot

Hi @BhargaviGudi. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Nov 27 '25 15:11 k8s-ci-robot

Codecov Report

:x: Patch coverage is 28.88889% with 32 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 24.18%. Comparing base (11d77f4) to head (613d628). :warning: Report is 1055 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #3052       +/-   ##
===========================================
- Coverage   45.50%   24.18%   -21.32%     
===========================================
  Files          79      125       +46     
  Lines        7782    17818    +10036     
===========================================
+ Hits         3541     4309      +768     
- Misses       4099    13225     +9126     
- Partials      142      284      +142     
:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Nov 27 '25 15:11 codecov-commenter

/ok-to-test

saschagrunert avatar Nov 27 '25 20:11 saschagrunert

/retest-required

BhargaviGudi avatar Dec 01 '25 06:12 BhargaviGudi

@BhargaviGudi do you mind a rebase?

saschagrunert avatar Dec 01 '25 14:12 saschagrunert