pixie icon indicating copy to clipboard operation
pixie copied to clipboard

Fix BPF token permission issues with 6.10 and later kernels

Open ddelnano opened this issue 1 year ago • 2 comments

This is closely related to #2040. Our qemu builds are unable to pass the newer BPF token permission checks, causing it to use the reduced (4096) BPF instruction limit. We should update our qemu VM image building process to ensure that it's able to use the 1M instruction limit.

Logs

$ bazel run -c dbg src/stirling/source_connectors/socket_tracer:dns_trace_bpf_test_qemu_interactive
bash-5.2# src/stirling/source_connectors/socket_tracer/dns_trace_bpf_test
I20241009 14:21:07.577044   135 socket_trace_connector.cc:468] Kernel version greater than V5.1 detected (6.11.1), raised loop limit to 882 and chunk limit to 84
I20241009 14:21:07.578644   135 kernel_version.cc:82] Obtained Linux version string from `uname`: 6.11.1
I20241009 14:21:07.578760   135 linux_headers.cc:381] Detected kernel release (uname -r): 6.11.1
I20241009 14:21:07.580492   135 linux_headers.cc:202] Using Linux headers from: /lib/modules/6.11.1/build and /lib/modules/6.11.1/source.
I20241009 14:21:07.585541   135 bcc_wrapper.cc:166] Initializing BPF program ...
I20241009 14:22:06.109444   135 scoped_timer.h:48] Timer(init_bpf_program) : 58.52 s
bpf: Argument list too long. Program  too large (18400 insns), at most 4096 insns

./src/stirling/source_connectors/socket_tracer/testing/socket_trace_bpf_test_fixture.h:54: Failure
Value of: IsOK(::px::StatusAdapter(source_->Init()))
  Actual: false (Internal : Failed to load syscall__probe_ret_writev: -1)
Expected: true
I20241009 14:22:11.929546   135 container_runner.cc:53] podman rm -f dns_server_52108331944 &>/dev/null
[  FAILED  ] DNSTraceTest.Capture (87555 ms)
[----------] 1 test from DNSTraceTest (87556 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (87557 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] DNSTraceTest.Capture

 1 FAILED TEST
I20241009 14:22:12.809012   135 env.cc:51] Shutting down

ddelnano avatar Oct 09 '24 14:10 ddelnano

This is now confirmed to happen in environments outside of our qemu builds and means that we have issues on kernels 6.10 and later. I've uploaded the logs for the reported instance of this. pixie_logs_20241117132749.zip

ddelnano avatar Nov 18 '24 05:11 ddelnano

With the introduction of #2047 (available in Vizier v0.14.13), users using impacted kernels can get Pixie running with the following PEM cli flags:

  • --stirling_bpf_loop_limit=41
  • --stirling_enable_mux_tracing=0
  • --stirling_enable_mongodb_tracing=0

Note: disabling mux and mongodb isn't specifically required. Each enabled protocol increases the size of Pixie's BPF program and so disabling any two protocols suffices as well.

ddelnano avatar Dec 06 '24 18:12 ddelnano