Perf Profiler's state checking frequently crashes in debug builds

Open ddelnano opened this issue 2 years ago • 1 comments

When deploying debug builds, I'm seeing the perf profiler's integrity checking firing often (source). This was a mechanism to ensure that we detect if the buffers allocated for the profiler are at risk of overrun. The condition is behind a DCHECK so it causes crashes for debug builds.

F20230821 02:38:38.536855 19627 perf_profile_connector.cc:370] Check failed: error_code == kPerfProfilerStatusOk (1 vs. 0)
*** Check failure stack trace: ***
E20230821 02:38:38.536998 19627 signal_action.cc:63] Caught Aborted, suspect faulting address 0x49a8. Trace:

The frequency of this seems to have increased, making it painful to run debug builds. I think we need to track down why this is occurring and adjust the thresholds to ensure the condition isn't met.

Reproduction Steps:

Deploy vizier via skaffold -- skaffold run -f skaffold/skaffold_vizier.yaml
Verify that PEMs crash with error above after some period of time

App information:

Pixie version: v0.14.5

Aug 24 '23 22:08 ddelnano

We reproduced the issue and observed kOverflowError. This means the profiler collected more stack traces than expected in a given time interval. The profiler assumes that it will transfer data from eBPF maps/tables into user space every 30 seconds. Thus, the data structures are provisioned for 30 seconds worth of data, plus some margin. This overflow error risks data loss (e.g. BPF stack traces table filling up or perf buffer wrapping around).

Running on a cluster with two nodes, both pems experienced the issue once about 1 minute after startup and then continued with 30 minutes of normal error free operation. Also, for the instance of the error condition, there was greater than one minute in between invocations of TransferDataImpl.

We conclude that there is some other work being done during pem startup that prevents the profiler from collecting data on according to its normal 30 second interval.

Sep 08 '23 22:09 etep