netobserv-ebpf-agent icon indicating copy to clipboard operation
netobserv-ebpf-agent copied to clipboard

NETOBSERV-2148: Switch PCA feature from using perf events to ringbuf

Open msherif1234 opened this issue 9 months ago • 27 comments

Description

using pref events while it has much lower performance compared to ringbuf but also enforce application to run in privileged mode because of kernel restrictions.

This PR migrate pca to use ringbuf

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • [ ] Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • [ ] Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • [ ] Does this PR require product documentation?
    • [ ] If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • [ ] Does this PR require a product release notes entry?
    • [ ] If so, fill in "Release Note Text" in the JIRA.
  • [ ] Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • [ ] If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • [ ] Standard QE validation, with pre-merge tests unless stated otherwise.
    • [ ] Regression tests only (e.g. refactoring with no user-facing change).
    • [ ] No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

msherif1234 avatar Mar 03 '25 13:03 msherif1234

/ok-to-test

msherif1234 avatar Mar 03 '25 13:03 msherif1234

New images: quay.io/netobserv/ebpf-bytecode:3dcf15b quay.io/netobserv/netobserv-ebpf-agent:3dcf15b

These will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=3dcf15b make set-agent-image

github-actions[bot] avatar Mar 03 '25 13:03 github-actions[bot]

tested with cli

 USER=netobserv NETOBSERV_AGENT_IMAGE=quay.io/netobserv/netobserv-ebpf-agent:3dcf15b COMMAND_ARGS="--protocol=TCP --port=80" make packets

image

msherif1234 avatar Mar 03 '25 14:03 msherif1234

@msherif1234: This pull request references NETOBSERV-2148 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Description

using pref events while it has much lower performance compared to ringbuf but also enforce application to run in privileged mode because of kernel restrictions.

This PR migrate pca to use ringbuf

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • [ ] Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • [ ] Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • [ ] Does this PR require product documentation?
  • [ ] If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • [ ] Does this PR require a product release notes entry?
  • [ ] If so, fill in "Release Note Text" in the JIRA.
  • [ ] Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • [ ] If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • [ ] Standard QE validation, with pre-merge tests unless stated otherwise.
  • [ ] Regression tests only (e.g. refactoring with no user-facing change).
  • [ ] No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Mar 03 '25 14:03 openshift-ci-robot

  • Before:
15565: sched_cls  name tcx_egress_pca_parse  tag cf185091d59c5f15  gpl
	loaded_at 2025-03-03T11:54:45-0500  uid 0
	xlated 7384B  jited 4740B  memlock 12288B  map_ids 615,616,617,618,619
	btf_id 650
	pids netobserv-ebpf-(1147708)
15566: sched_cls  name tcx_ingress_pca_parse  tag 6e1c4d2436defe26  gpl
	loaded_at 2025-03-03T11:54:45-0500  uid 0
	xlated 7384B  jited 4737B  memlock 12288B  map_ids 615,616,617,618,619
	btf_id 651
	pids netobserv-ebpf-(1147708)

sudo perf stat -e cycles,instructions --bpf-prog 15565 --timeout 10000
 Performance counter stats for 'BPF program(s) 15565':

         2,798,598      cycles                                                                
           940,169      instructions                     #    0.34  insn per cycle            

      10.012628932 seconds time elapsed

sudo perf stat -e cycles,instructions --bpf-prog 15566 --timeout 10000
Performance counter stats for 'BPF program(s) 15566':

         2,661,480      cycles                                                                
           831,513      instructions                     #    0.31  insn per cycle            

      10.011295383 seconds time elapsed
  • After:
15634: sched_cls  name tcx_egress_pca_parse  tag 3b823b0d1e696fd4  gpl
	loaded_at 2025-03-03T12:02:29-0500  uid 0
	xlated 7984B  jited 5020B  memlock 12288B  map_ids 687,688,689,690,691
	btf_id 738
	pids netobserv-ebpf-(1152392)
15635: sched_cls  name tcx_ingress_pca_parse  tag 8784a6295ce1517f  gpl
	loaded_at 2025-03-03T12:02:29-0500  uid 0
	xlated 7984B  jited 5017B  memlock 12288B  map_ids 687,688,689,690,691
	btf_id 739
	pids netobserv-ebpf-(1152392)

sudo perf stat -e cycles,instructions --bpf-prog 15634 --timeout 10000

 Performance counter stats for 'BPF program(s) 15634':

         1,064,322      cycles                                                                
           388,642      instructions                     #    0.37  insn per cycle            

      10.012311676 seconds time elapsed

sudo perf stat -e cycles,instructions --bpf-prog 15635 --timeout 10000

 Performance counter stats for 'BPF program(s) 15635':

         2,018,524      cycles                                                                
           644,693      instructions                     #    0.32  insn per cycle            

      10.012645276 seconds time elapsed

msherif1234 avatar Mar 03 '25 17:03 msherif1234

/ok-to-test

msherif1234 avatar Mar 05 '25 11:03 msherif1234

New images: quay.io/netobserv/ebpf-bytecode:2d61633 quay.io/netobserv/netobserv-ebpf-agent:2d61633

These will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=2d61633 make set-agent-image

github-actions[bot] avatar Mar 05 '25 11:03 github-actions[bot]

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from jotak. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Mar 07 '25 14:03 openshift-ci[bot]

/ok-to-test

msherif1234 avatar Mar 07 '25 14:03 msherif1234

New images: quay.io/netobserv/ebpf-bytecode:5781985 quay.io/netobserv/netobserv-ebpf-agent:5781985

These will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=5781985 make set-agent-image

github-actions[bot] avatar Mar 07 '25 14:03 github-actions[bot]

/lgtm

jotak avatar Mar 10 '25 15:03 jotak

@msherif1234 All QE backend e2e tests are failing with eBPF daemonset not getting ready. Could you PTAL?

Amoghrd avatar Mar 10 '25 16:03 Amoghrd

/test qe-e2e-tests

msherif1234 avatar Mar 10 '25 16:03 msherif1234

@msherif1234 All QE backend e2e tests are failing with eBPF daemonset not getting ready. Could you PTAL?

@Amoghrd my changes should have no impact to regular agent functionality its limited to pca feature which isn't something e2e will be running I rerun it again to see if this consistent or flake

msherif1234 avatar Mar 10 '25 16:03 msherif1234

/retest

Amoghrd avatar Mar 10 '25 21:03 Amoghrd

/ok-to-test

memodi avatar Mar 11 '25 01:03 memodi

New images: quay.io/netobserv/ebpf-bytecode:1355cdd quay.io/netobserv/netobserv-ebpf-agent:1355cdd

These will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=1355cdd make set-agent-image

github-actions[bot] avatar Mar 11 '25 01:03 github-actions[bot]

/ok-to-test

msherif1234 avatar Mar 11 '25 12:03 msherif1234

New images: quay.io/netobserv/ebpf-bytecode:a8fc15b quay.io/netobserv/netobserv-ebpf-agent:a8fc15b

These will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=a8fc15b make set-agent-image

github-actions[bot] avatar Mar 11 '25 12:03 github-actions[bot]

/test images /test qe-e2e-tests

msherif1234 avatar Mar 11 '25 13:03 msherif1234

/hold https://issues.redhat.com/browse/RHEL-83254

msherif1234 avatar Mar 11 '25 13:03 msherif1234

/test images /test qe-e2e-tests

msherif1234 avatar Mar 11 '25 17:03 msherif1234

/test images /test qe-e2e-tests

msherif1234 avatar Mar 12 '25 10:03 msherif1234

/ok-to-test

msherif1234 avatar Mar 18 '25 10:03 msherif1234

/hold

msherif1234 avatar Mar 26 '25 11:03 msherif1234

/hold https://issues.redhat.com/browse/RHEL-83254

@jotak can you pls monitor the progress to the above fix once it lands when need to bring this PR in we might be able to drop MONITOR capability after this change is in

msherif1234 avatar May 15 '25 11:05 msherif1234

/hold https://issues.redhat.com/browse/RHEL-83254

@jotak can you pls monitor the progress to the above fix once it lands when need to bring this PR in we might be able to drop MONITOR capability after this change is in

yep, I'm watching it 👍

jotak avatar May 16 '25 11:05 jotak

FYI, RHEL-83254 is closed / merged

jotak avatar Aug 11 '25 07:08 jotak

(rebased without conflict)

jotak avatar Aug 11 '25 07:08 jotak

/lgtm @memodi / @Amoghrd , to summarize what IMO needs to be double-checked here:

  1. That this PR doesn't affect agents when deployed from the operator, for flow collection (ie. no regression), even on older openshift (pre-rhel 9.4, such as 4.12 or 4.14)
  2. Similarly, no regression when used with the CLI for flows, even on ocp 4.12 / 4.14
  3. No regression with the CLI pcap with ocp based on rhel 9.4 or above (e.g. ocp 4.16 and 4.19)
  4. We expect a regression with the CLI pcap on ocp 4.12 / 4.14 / 4.15. In that case, it's necessary to pass a different agent image to the CLI args, such as agent 1.9, as a workaround.

(5. Also, it will be worth checking especially for regression on s390 because there's a change related to endianess (probably post-merge?))

jotak avatar Aug 11 '25 07:08 jotak