profiler Efficient representation of long idle stacks

When a thread is running on the CPU, we sample it at a certain rate, and we record a timestamp and a stack for each sample. When a thread is not running on the CPU, we still want to know where it's spending its time, i.e. what stack it is blocked in and for how long.

We currently don't have a good way to represent the latter efficiently.

A thread can be blocked for a long time. At the moment, we sample it at a regular rate during the blocked state, and we put each sample into the profile individually, including each sample's timestamp. But those timestamps take up a lot of space in the JSON and are not very interesting. We really only want to know roughly when the thread started blocking and when it stopped - we want a start and stop timestamp, and a sample count. As for CPU deltas, the situation is the following: The first sample in an "blocked time range" usually has some greater than zero CPU delta. All the following samples have a CPU delta of zero.

With today's processed profile format, we can try to use of the weight column. We can pick the meaning of the sample weight by specifying a weightType: It can be "samples", "tracing-ms", or "bytes". So we would specify "samples", so that a single sample with weight N gets counted as N samples. But then what do we do with the sample timestamp? And where do we put the CPU delta of the first sample?

Here's a solution that works with today's format: We can have two samples per off-CPU range. One sample with weight 1 at the start of the range, which carries the non-zero leftover CPU delta, and one sample with weight N at the end of the range, with CPU delta zero; N being the number of samples that were collapsed into this end-of-range sample. And here's an example profile with that solution.

It works ok, but there are two problems with this solution:

The "sample timestamp" graph with the blue dots has gaps. These gaps make it look like we were not able to sample. But in fact we were able to sample just fine during that duration; we just didn't emit a sample with that timestamp. And, luckily, clicking in those gaps selects the correct stack. That's because those clicks go between the "start sample" and the "end sample" of the blocked duration, and both of those samples have the same stack (the blocked stack), and clicking in the space between these samples will target one of those samples.
When you select a time range, the numbers in the call tree and flame graph "jump" as you drag past the weighted samples. The "blocked range" is included in the selection on an all-or-nothing basis: Either its full weight is included, or nothing. This is not good. If I select half of the blocked range, I would like the number in the call tree to be the sample count of half of the blocked range.

I'd like to find a solution which can do the following:

In the "sample timestamp" graph, draw wide rectangles for "long samples".
When selecting a part of the "long sample", the weight in the call tree / flame graph should be proportional to the selected fraction.
The call tree and flame graph should still display sample counts, not precise durations. After all, we are still dealing with sampling data, not with tracing data. (This disqualifies using weightType: "tracing-ms".)

I think these requirements call for a different format to represent this data.

Jun 20 '22 20:06 mstange

Here's one proposal:

We could have both a weight column and a duration column on the samples table. For on-CPU samples, the weight would be 1 and the duration would be 0. For off-CPU samples, the weight would be N and the duration would be the difference between the first and last observed timestamp in the blocked range. And the sample's timestamp would be the first observed timestamp in the blocked range. Effectively, samples with a non-zero duration would have a "start timestamp" and an "end timestamp", with the start timestamp being sample.timestamp and the end timestamp being (sample.timestamp + sample.duration).

The weight would determine the sample count that's displayed in the call tree and flame graph. The duration would determine the length of the blue rectangle in the sample timestamp graph. For partial selections, the sample would contribute a count of round(sample.weight * (selectedDurationIntersectingSample / sample.duration)). And CPU deltas would work as follows:

The CPU delta on the weighted sample describes the CPU delta before the sample.
For the duration of the sample, i.e. between sample.timestamp and (sample.timestamp + sample.duration), CPU delta is considered to be zero.
The CPU delta of the next sample is between the weighted sample's "end time" and the timestamp of the next sample.

Jun 20 '22 21:06 mstange

When you select a time range, the numbers in the call tree and flame graph "jump" as you drag past the weighted samples. The "blocked range" is included in the selection on an all-or-nothing basis: Either its full weight is included, or nothing. This is not good. If I select half of the blocked range, I would like the number in the call tree to be the sample count of half of the blocked range.

https://user-images.githubusercontent.com/961291/174679149-9363a962-bdfc-48d8-8eca-e2ecada67596.mov

Jun 20 '22 21:06 mstange

Here's what this would look like for an example profile. This example profile declares an intended sampling interval of 1ms but ends up sampling every 1.4ms.

All samples

Timestamp    Stack         CPU delta
----------------------------------------------------------
100.0ms      On-CPU #1     0.5ms
101.4ms      Off-CPU #1    0.1ms
102.8ms      Off-CPU #1    0ms
104.2ms      Off-CPU #1    0ms
105.6ms      Off-CPU #1    0ms
107.0ms      On-CPU #2     0.7ms
108.4ms      Off-CPU #2    0.6ms
109.8ms      Off-CPU #2    0ms
111.2ms      Off-CPU #2    0ms
112.6ms      Off-CPU #2    0.1ms  // <-- some on-CPU work happened between these two samples but it wasn't sampled
114.0ms      Off-CPU #2    0ms

With today's format + `weight` column, two samples per blocked range

Timestamp    Stack         CPU delta   Weight
----------------------------------------------------------
100.0ms      On-CPU #1     0.5ms       1
101.4ms      Off-CPU #1    0.1ms       1
105.6ms      Off-CPU #1    0ms         3
107.0ms      On-CPU #2     0.7ms       1
108.4ms      Off-CPU #2    0.6ms       1
111.2ms      Off-CPU #2    0ms         2
112.6ms      Off-CPU #2    0.1ms       1
114.0ms      Off-CPU #2    0ms         1

With proposed format, both `weight` and `duration`, one sample per blocked range

Timestamp    Stack         CPU delta   Weight   Duration
----------------------------------------------------------
100.0ms      On-CPU #1     0.5ms       1        0ms
101.4ms      Off-CPU #1    0.1ms       4        4.2ms
107.0ms      On-CPU #2     0.7ms       1        0ms
108.4ms      Off-CPU #2    0.6ms       3        2.8ms
112.6ms      Off-CPU #2    0.1ms       2        1.4ms

Jun 20 '22 21:06 mstange

profiler profiler copied to clipboard

Efficient representation of long idle stacks

All samples

With today's format + weight column, two samples per blocked range

With proposed format, both weight and duration, one sample per blocked range

profiler
profiler copied to clipboard

With today's format + `weight` column, two samples per blocked range

With proposed format, both `weight` and `duration`, one sample per blocked range