kibana [Profiling] Reduce CPU usage for flamegraph and TopN function

Part of https://github.com/elastic/prodfiler/issues/2630. Re-opens https://github.com/elastic/kibana/pull/140774.

This PR reduces how much work is needed to generate a flamegraph.

The original motivation was to reduce CPU usage and response time for flamegraphs. An unattended side effect is that the TopN functions API has its own modest performance improvement (see benchmark results below).

Several commits address the following minor but beneficial updates:

inlined flamegraph post-processing pipeline into respective route
additional fine-grained instrumentation to measure flamegraph via APM and logger
miscellaneous micro-optimizations (e.g. pre-allocation, pre-calculation, move calls out of hot path, etc)
remove extraneous intermediate data structure (CallerCalleeIntermediateNode)
miscellaneous refactoring

The bulk of the performance improvements falls into two categories:

minimize or eliminate unnecessary allocations
relax properties where appropriate to reduce CPU-related activity

In the first category, after collecting the Elasticsearch data originally, we first construct trace metadata for later use (i.e. groupStackFrameMetadataByStackTrace). Then we construct a tree out of the stacktraces (i.e. createCallerCalleeIntermediateRoot), merging similar sibling frames based on some criteria (i.e. FrameGroup).

The problem is that the ratio between the number of stack frames processed in the first step versus the second step can be quite different (roughly 80% smaller based on local data -- 300k frames vs 50k frames). This means that we create many StackFrameMetadata objects that are eventually unused, thus, increasing pressure on the GC and unnecessarily using the CPU.

I resolved this by delaying object creation until absolutely necessary via the LazyStackFrameMetadata interface. This interface only contains the associated frame group, frame group ID, and an index into the array where the relevant frame information will be pulled from.

In the second category, we originally sorted sibling nodes in the constructed tree by samples and frame group. Based on profiling, it was discovered that sorting took a significant amount of time. So instead of sorting partially by frame group, we now sort by samples and frame group ID. This has the benefit of being faster without losing the determinism we want when generating a flamegraph.

Flamegraph Benchmarks

main is our baseline for comparison and flamegraph is the branch associated with this PR.

All metrics are in milliseconds.

Experiment	Seconds	Minimum	Q1	Median	Q3	P90	P95	P99	Maximum
main	900	2481	2542	2574	2625	2721	2781	2918	4141
main	1800	2688	2741	2769	2805	2928	2958	3015	3029
main	3600	2498	2556	2585	2625	2753	2792	2829	2938
main	86400	2637	2701	2733	2770	2859	2907	2972	3028
flamegraph	900	-13.78%	-13.26%	-13.01%	-11.50%	-11.43%	-11.79%	-6.44%	-26.32%
flamegraph	1800	-11.83%	-11.38%	-11.48%	-11.44%	-13.52%	-12.64%	-12.04%	-11.29%
flamegraph	3600	-14.61%	-14.51%	-14.43%	-14.70%	-17.51%	-17.87%	-17.50%	-14.87%
flamegraph	86400	-12.48%	-12.18%	-11.93%	-12.09%	-13.82%	-13.59%	-13.63%	-11.29%

TopN Functions Benchmarks

main is our baseline for comparison and functions is the branch associated with this PR.

All metrics are in milliseconds.

Experiment	Seconds	Minimum	Q1	Median	Q3	P90	P95	P99	Maximum
main	900	2030	2093	2117	2146	2181	2246	2356	2366
main	1800	2269	2338	2374	2408	2453	2546	2632	2663
main	3600	2057	2106	2129	2157	2198	2318	2370	2404
main	86400	2147	2221	2245	2277	2325	2397	2461	2488
functions	900	-2.81%	-3.73%	-3.64%	-3.45%	-3.62%	-5.34%	-7.89%	-5.37%
functions	1800	-1.94%	-2.22%	-2.57%	-2.57%	-2.77%	-4.67%	-4.33%	-4.09%
functions	3600	-4.47%	-3.94%	-3.90%	-3.80%	-3.59%	-6.82%	-5.99%	-5.74%
functions	86400	-3.73%	-3.65%	-3.61%	-3.78%	-3.27%	-3.46%	-5.12%	-5.06%

Sep 20 '22 04:09 jbcrail

@elasticmachine merge upstream

Sep 20 '22 07:09 rockdaboot

Is there some code missing perhaps ? You benchmarked, but I can't build (same errors as the CI reports).

Sep 20 '22 10:09 rockdaboot

@rockdaboot The groupStackFrameMetadataByStackTrace function was missing since I removed it thinking that it was no longer needed. It's only needed for the TopN stacktraces, so the above benchmarks are still valid.

Sep 20 '22 18:09 jbcrail

@elasticmachine merge upstream

Sep 20 '22 18:09 jbcrail

:green_heart: Build Succeeded

Buildkite Build
Commit: 70b5b85fa4d7662e6fff13af34f69f01859cbc3d

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`profiling`	369.5KB	369.6KB	+77.0B

History

:broken_heart: Build #74859 failed 6bd71b7b8b19ded310659f44a54d3c26746d7c8c
:yellow_heart: Build #74680 was flaky 2e08b0ced57517b65d8cdbf7ceb59eea9cb2a5dc
:green_heart: Build #74580 succeeded cb673bd660237d6b73f8393e60ed912f31241c73
:broken_heart: Build #74293 failed 81c1524fa62d6a78fbd8a9b0f2ca5f6e67e542c3
:green_heart: Build #74243 succeeded 01d3095f1ada1f16d333fcd012708327afdc3afc
:green_heart: Build #74161 succeeded 8d21964004361bc822ca98cbba16793779f264e2

To update your PR or re-run it, just comment with: @elasticmachine merge upstream

Sep 22 '22 14:09 kibana-ci

Good news: There is a good speedup when testing locally. ~~Bad news: The flamegraph looks differently now (compared with main).~~

Sep 22 '22 15:09 rockdaboot

Huh, for some reason I didn't have all your commits locally. Now the flamegraphs look the same ! Sorry for that.

Sep 22 '22 16:09 rockdaboot

💚 All backports created successfully

Status	Branch	Result
✅	8.5

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

Sep 22 '22 16:09 kibanamachine

kibana kibana copied to clipboard

[Profiling] Reduce CPU usage for flamegraph and TopN function

Flamegraph Benchmarks

TopN Functions Benchmarks

:green_heart: Build Succeeded

Metrics [docs]

Async chunks

History

💚 All backports created successfully

Questions ?

kibana
kibana copied to clipboard