netobserv-ebpf-agent
netobserv-ebpf-agent copied to clipboard
NETOBSERV-2148: Switch PCA feature from using perf events to ringbuf
Description
using pref events while it has much lower performance compared to ringbuf but also enforce application to run in privileged mode because of kernel restrictions.
This PR migrate pca to use ringbuf
Dependencies
n/a
Checklist
If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.
- [ ] Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
- [ ] Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
- [ ] Does this PR require product documentation?
- [ ] If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
- [ ] Does this PR require a product release notes entry?
- [ ] If so, fill in "Release Note Text" in the JIRA.
- [ ] Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
- [ ] If so, make sure it is described in the JIRA ticket.
- QE requirements (check 1 from the list):
- [ ] Standard QE validation, with pre-merge tests unless stated otherwise.
- [ ] Regression tests only (e.g. refactoring with no user-facing change).
- [ ] No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).
/ok-to-test
New images: quay.io/netobserv/ebpf-bytecode:3dcf15b quay.io/netobserv/netobserv-ebpf-agent:3dcf15b
These will expire after two weeks.
To deploy this build, run from the operator repo, assuming the operator is running:
USER=netobserv VERSION=3dcf15b make set-agent-image
tested with cli
USER=netobserv NETOBSERV_AGENT_IMAGE=quay.io/netobserv/netobserv-ebpf-agent:3dcf15b COMMAND_ARGS="--protocol=TCP --port=80" make packets
@msherif1234: This pull request references NETOBSERV-2148 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.
In response to this:
Description
using pref events while it has much lower performance compared to ringbuf but also enforce application to run in
privilegedmode because of kernel restrictions.This PR migrate pca to use ringbuf
Dependencies
n/a
Checklist
If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.
- [ ] Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
- [ ] Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
- [ ] Does this PR require product documentation?
- [ ] If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
- [ ] Does this PR require a product release notes entry?
- [ ] If so, fill in "Release Note Text" in the JIRA.
- [ ] Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
- [ ] If so, make sure it is described in the JIRA ticket.
- QE requirements (check 1 from the list):
- [ ] Standard QE validation, with pre-merge tests unless stated otherwise.
- [ ] Regression tests only (e.g. refactoring with no user-facing change).
- [ ] No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
- Before:
15565: sched_cls name tcx_egress_pca_parse tag cf185091d59c5f15 gpl
loaded_at 2025-03-03T11:54:45-0500 uid 0
xlated 7384B jited 4740B memlock 12288B map_ids 615,616,617,618,619
btf_id 650
pids netobserv-ebpf-(1147708)
15566: sched_cls name tcx_ingress_pca_parse tag 6e1c4d2436defe26 gpl
loaded_at 2025-03-03T11:54:45-0500 uid 0
xlated 7384B jited 4737B memlock 12288B map_ids 615,616,617,618,619
btf_id 651
pids netobserv-ebpf-(1147708)
sudo perf stat -e cycles,instructions --bpf-prog 15565 --timeout 10000
Performance counter stats for 'BPF program(s) 15565':
2,798,598 cycles
940,169 instructions # 0.34 insn per cycle
10.012628932 seconds time elapsed
sudo perf stat -e cycles,instructions --bpf-prog 15566 --timeout 10000
Performance counter stats for 'BPF program(s) 15566':
2,661,480 cycles
831,513 instructions # 0.31 insn per cycle
10.011295383 seconds time elapsed
- After:
15634: sched_cls name tcx_egress_pca_parse tag 3b823b0d1e696fd4 gpl
loaded_at 2025-03-03T12:02:29-0500 uid 0
xlated 7984B jited 5020B memlock 12288B map_ids 687,688,689,690,691
btf_id 738
pids netobserv-ebpf-(1152392)
15635: sched_cls name tcx_ingress_pca_parse tag 8784a6295ce1517f gpl
loaded_at 2025-03-03T12:02:29-0500 uid 0
xlated 7984B jited 5017B memlock 12288B map_ids 687,688,689,690,691
btf_id 739
pids netobserv-ebpf-(1152392)
sudo perf stat -e cycles,instructions --bpf-prog 15634 --timeout 10000
Performance counter stats for 'BPF program(s) 15634':
1,064,322 cycles
388,642 instructions # 0.37 insn per cycle
10.012311676 seconds time elapsed
sudo perf stat -e cycles,instructions --bpf-prog 15635 --timeout 10000
Performance counter stats for 'BPF program(s) 15635':
2,018,524 cycles
644,693 instructions # 0.32 insn per cycle
10.012645276 seconds time elapsed
/ok-to-test
New images: quay.io/netobserv/ebpf-bytecode:2d61633 quay.io/netobserv/netobserv-ebpf-agent:2d61633
These will expire after two weeks.
To deploy this build, run from the operator repo, assuming the operator is running:
USER=netobserv VERSION=2d61633 make set-agent-image
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from jotak. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/ok-to-test
New images: quay.io/netobserv/ebpf-bytecode:5781985 quay.io/netobserv/netobserv-ebpf-agent:5781985
These will expire after two weeks.
To deploy this build, run from the operator repo, assuming the operator is running:
USER=netobserv VERSION=5781985 make set-agent-image
/lgtm
@msherif1234 All QE backend e2e tests are failing with eBPF daemonset not getting ready. Could you PTAL?
/test qe-e2e-tests
@msherif1234 All QE backend e2e tests are failing with eBPF daemonset not getting ready. Could you PTAL?
@Amoghrd my changes should have no impact to regular agent functionality its limited to pca feature which isn't something e2e will be running I rerun it again to see if this consistent or flake
/retest
/ok-to-test
New images: quay.io/netobserv/ebpf-bytecode:1355cdd quay.io/netobserv/netobserv-ebpf-agent:1355cdd
These will expire after two weeks.
To deploy this build, run from the operator repo, assuming the operator is running:
USER=netobserv VERSION=1355cdd make set-agent-image
/ok-to-test
New images: quay.io/netobserv/ebpf-bytecode:a8fc15b quay.io/netobserv/netobserv-ebpf-agent:a8fc15b
These will expire after two weeks.
To deploy this build, run from the operator repo, assuming the operator is running:
USER=netobserv VERSION=a8fc15b make set-agent-image
/test images /test qe-e2e-tests
/hold https://issues.redhat.com/browse/RHEL-83254
/test images /test qe-e2e-tests
/test images /test qe-e2e-tests
/ok-to-test
/hold
/hold https://issues.redhat.com/browse/RHEL-83254
@jotak can you pls monitor the progress to the above fix once it lands when need to bring this PR in we might be able to drop MONITOR capability after this change is in
/hold https://issues.redhat.com/browse/RHEL-83254
@jotak can you pls monitor the progress to the above fix once it lands when need to bring this PR in we might be able to drop MONITOR capability after this change is in
yep, I'm watching it 👍
FYI, RHEL-83254 is closed / merged
(rebased without conflict)
/lgtm @memodi / @Amoghrd , to summarize what IMO needs to be double-checked here:
- That this PR doesn't affect agents when deployed from the operator, for flow collection (ie. no regression), even on older openshift (pre-rhel 9.4, such as 4.12 or 4.14)
- Similarly, no regression when used with the CLI for flows, even on ocp 4.12 / 4.14
- No regression with the CLI pcap with ocp based on rhel 9.4 or above (e.g. ocp 4.16 and 4.19)
- We expect a regression with the CLI pcap on ocp 4.12 / 4.14 / 4.15. In that case, it's necessary to pass a different agent image to the CLI args, such as agent 1.9, as a workaround.
(5. Also, it will be worth checking especially for regression on s390 because there's a change related to endianess (probably post-merge?))