ddprof overhead of perf?

i use perf -e cpu -a -g to run profiling in machine continously.

what is the overhead for process?

as i know, the perf_event_open will cause user program context switch when counter trigged?

Aug 02 '24 02:08 zdyj3170101136

👋

perf and ddprof have similar overheads, so I think I can answer this one.

There are a few different sources of overhead, but generally speaking it works out like this.

During initialization, perf (and ddprof) call perf_event_open() a few times per CPU core to setup instrumentation. There are a few ways to use perf_event_open() in order to receive messages from the kernel, but both tools opt for a shared-memory ringbuffer with the kernel.
Depending on what event type -e you use, different instrumentation points are enabled in the kernel. The precise overhead depends on the sampling frequency (-F) as well, since that determines when the perf event subsystem will collect and transmit additional data. I find the collection mechanism to be extremely efficient and it's very difficult for me to detect it, except when I'm running benchmarks (like linpack). I don't have an estimate for the overhead for this instrumentation except that it is "very small"--fractions of a percent of each CPU.
Once perf and ddprof receive the events collected in the step above, they have to process this events into user-readable stack traces. I think in the default configuration, -g uses framepointer unwinding. This is pretty fast, since perf/ddprof only need to navigate a linked list. However, specifying --call-graph dwarf forces perf to use DWARF unwinding, which is much more accurate, but has much more overhead.
After unwinding, both perf and ddprof need to symbolize addresses. In practice, this is pretty expensive (ddprof uses some pretty aggressive caching here to reduce overhead, I think perf may rely on libdw caches for this).
Finally, both tools have to serialize the data somehow. perf writes to disk and ddprof writes to the network. In general, I think perf's serialization is more efficient than ours, since we need to use a more generic container format.

Of these sources of overhead,

Is a one-time setup cost that mostly runs before your application even starts. It might delay application start by a few milliseconds, but not much. The overhead scales with number of cores.
Is an overhead that mostly occurs within the Linux kernel itself. A lot of people think this is the scariest and most significant form of overhead from both perf and ddprof, but it's actually small compared to the next sources. I would estimate this at less than 0.1% for most workloads (although you can make this very large by changing the sampling interval to some extremely high value or attaching to many event types)
Unwinding is pretty expensive. I don't know what it is for perf, but with all the caches and optimization in ddprof, it can still use up 1-2% of CPU when DWARF is enabled. For framepointer, it's much less.
I'm not sure about the perf symbolization overhead. Our caches are different. I include it in the 1-2% unwinding overhead.
I don't have precise numbers for serialization.

These are rough estimates from internal studies we've done with ddprof, and I'm assuming that perf is in the same category of overhead. If you're looking for really granular amounts, you should note that:

Some forms of overhead (category 2, mostly) will block the instrumented process, and possibly other processes, every time certain operations are done. They can be managed slightly be modifying the sampling frequency (or using the hardware counters when they're available, like on dedicated/baremetal EC2 instances, since they're slightly faster).
Other forms of overhead (categories 3-5) consume CPU in a manner that doesn't directly block the instrumented application, but does consume significant resources. This type of overhead will be hard to quantify except in the context of your specific application.

If you're trying to estimate the overhead of this type of instrumentation (indeed, any kind of profiling or observability), here are my own, personal recommendations. You may have your own discipline here already and my ideas might be too stupid for your purposes, but I'll share them anyway in case there is value.

You need to start out with a definition of "overhead" that is practical for your application. This is almost always a measure of "work per unit time" (throughput) or "time per unit of work" (latency). It will depend on what your application does and how you or your own users derive value from it. For instance, some of our batch-processing systems only care about aggregate throughput, but many of our customer-facing services have latency requirements we need to uphold.
You should record baseline data in a manner that respects your load. For instance, if your load can spike by 100x when there's a sale happening, and maintaining service during those spikes is important, then those spikes should be represented in your data.
Comparisons should be done against distributions rather than averages. You may find that instrumentation generally has very little effect on your application in general, but it causes the slowest 5% of your requests to run 5x slower (or something).

I apologize that this summary isn't very precise, but an in-depth study should probably be related to a given workload. In general though, we find that with DWARF unwinding enabled, we rarely see services hit more than ~2% overhead (and usually much less).

Aug 02 '24 13:08 sanchda

2. Other forms of overhead (categories 3-5) consume CPU in a manner that doesn't directly block the instrumented application

You said "Other forms of overhead (categories 3-5) consume CPU in a manner that doesn't directly block the instrumented application". Wouldn't unwinding the stack cause the sampled application to enter kernel mode from user mode and be blocked at the same time? Because whether unwinding is done through frame pointer or dwarf, the user program's stack needs to be used.

Aug 05 '24 02:08 zdyj3170101136

If I only collect the stack through perf record -e cpu -a -g, without symbolization (which will be done asynchronously on another machine).

Then its overhead is basically the overhead of switching the user program from user mode to kernel mode (frame pointer stack expansion takes ns level)

Aug 05 '24 02:08 zdyj3170101136

We are currently working on a eBPF approach, which will allow some comparison of the overhead. You can refer to this work here: https://github.com/DataDog/dd-otel-host-profiler

Closing this discussion for now.

Oct 14 '24 12:10 r1viollet

ddprof ddprof copied to clipboard

overhead of perf?

ddprof
ddprof copied to clipboard