kibana
kibana copied to clipboard
[Profiling] Reduce CPU usage for flamegraph and TopN function
Part of https://github.com/elastic/prodfiler/issues/2630. Re-opens https://github.com/elastic/kibana/pull/140774.
This PR reduces how much work is needed to generate a flamegraph.
The original motivation was to reduce CPU usage and response time for flamegraphs. An unattended side effect is that the TopN functions API has its own modest performance improvement (see benchmark results below).
Several commits address the following minor but beneficial updates:
- inlined flamegraph post-processing pipeline into respective route
- additional fine-grained instrumentation to measure flamegraph via APM and logger
- miscellaneous micro-optimizations (e.g. pre-allocation, pre-calculation, move calls out of hot path, etc)
- remove extraneous intermediate data structure (
CallerCalleeIntermediateNode) - miscellaneous refactoring
The bulk of the performance improvements falls into two categories:
- minimize or eliminate unnecessary allocations
- relax properties where appropriate to reduce CPU-related activity
In the first category, after collecting the Elasticsearch data originally, we first construct trace metadata for later use (i.e. groupStackFrameMetadataByStackTrace). Then we construct a tree out of the stacktraces (i.e. createCallerCalleeIntermediateRoot), merging similar sibling frames based on some criteria (i.e. FrameGroup).
The problem is that the ratio between the number of stack frames processed in the first step versus the second step can be quite different (roughly 80% smaller based on local data -- 300k frames vs 50k frames). This means that we create many StackFrameMetadata objects that are eventually unused, thus, increasing pressure on the GC and unnecessarily using the CPU.
I resolved this by delaying object creation until absolutely necessary via the LazyStackFrameMetadata interface. This interface only contains the associated frame group, frame group ID, and an index into the array where the relevant frame information will be pulled from.
In the second category, we originally sorted sibling nodes in the constructed tree by samples and frame group. Based on profiling, it was discovered that sorting took a significant amount of time. So instead of sorting partially by frame group, we now sort by samples and frame group ID. This has the benefit of being faster without losing the determinism we want when generating a flamegraph.
Flamegraph Benchmarks
main is our baseline for comparison and flamegraph is the branch associated with this PR.
All metrics are in milliseconds.
| Experiment | Seconds | Minimum | Q1 | Median | Q3 | P90 | P95 | P99 | Maximum |
|---|---|---|---|---|---|---|---|---|---|
| main | 900 | 2481 | 2542 | 2574 | 2625 | 2721 | 2781 | 2918 | 4141 |
| main | 1800 | 2688 | 2741 | 2769 | 2805 | 2928 | 2958 | 3015 | 3029 |
| main | 3600 | 2498 | 2556 | 2585 | 2625 | 2753 | 2792 | 2829 | 2938 |
| main | 86400 | 2637 | 2701 | 2733 | 2770 | 2859 | 2907 | 2972 | 3028 |
| flamegraph | 900 | -13.78% | -13.26% | -13.01% | -11.50% | -11.43% | -11.79% | -6.44% | -26.32% |
| flamegraph | 1800 | -11.83% | -11.38% | -11.48% | -11.44% | -13.52% | -12.64% | -12.04% | -11.29% |
| flamegraph | 3600 | -14.61% | -14.51% | -14.43% | -14.70% | -17.51% | -17.87% | -17.50% | -14.87% |
| flamegraph | 86400 | -12.48% | -12.18% | -11.93% | -12.09% | -13.82% | -13.59% | -13.63% | -11.29% |
TopN Functions Benchmarks
main is our baseline for comparison and functions is the branch associated with this PR.
All metrics are in milliseconds.
| Experiment | Seconds | Minimum | Q1 | Median | Q3 | P90 | P95 | P99 | Maximum |
|---|---|---|---|---|---|---|---|---|---|
| main | 900 | 2030 | 2093 | 2117 | 2146 | 2181 | 2246 | 2356 | 2366 |
| main | 1800 | 2269 | 2338 | 2374 | 2408 | 2453 | 2546 | 2632 | 2663 |
| main | 3600 | 2057 | 2106 | 2129 | 2157 | 2198 | 2318 | 2370 | 2404 |
| main | 86400 | 2147 | 2221 | 2245 | 2277 | 2325 | 2397 | 2461 | 2488 |
| functions | 900 | -2.81% | -3.73% | -3.64% | -3.45% | -3.62% | -5.34% | -7.89% | -5.37% |
| functions | 1800 | -1.94% | -2.22% | -2.57% | -2.57% | -2.77% | -4.67% | -4.33% | -4.09% |
| functions | 3600 | -4.47% | -3.94% | -3.90% | -3.80% | -3.59% | -6.82% | -5.99% | -5.74% |
| functions | 86400 | -3.73% | -3.65% | -3.61% | -3.78% | -3.27% | -3.46% | -5.12% | -5.06% |
@elasticmachine merge upstream
Is there some code missing perhaps ? You benchmarked, but I can't build (same errors as the CI reports).
@rockdaboot The groupStackFrameMetadataByStackTrace function was missing since I removed it thinking that it was no longer needed. It's only needed for the TopN stacktraces, so the above benchmarks are still valid.
@elasticmachine merge upstream
:green_heart: Build Succeeded
- Buildkite Build
- Commit: 70b5b85fa4d7662e6fff13af34f69f01859cbc3d
Metrics [docs]
Async chunks
Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app
| id | before | after | diff |
|---|---|---|---|
profiling |
369.5KB | 369.6KB | +77.0B |
History
- :broken_heart: Build #74859 failed 6bd71b7b8b19ded310659f44a54d3c26746d7c8c
- :yellow_heart: Build #74680 was flaky 2e08b0ced57517b65d8cdbf7ceb59eea9cb2a5dc
- :green_heart: Build #74580 succeeded cb673bd660237d6b73f8393e60ed912f31241c73
- :broken_heart: Build #74293 failed 81c1524fa62d6a78fbd8a9b0f2ca5f6e67e542c3
- :green_heart: Build #74243 succeeded 01d3095f1ada1f16d333fcd012708327afdc3afc
- :green_heart: Build #74161 succeeded 8d21964004361bc822ca98cbba16793779f264e2
To update your PR or re-run it, just comment with:
@elasticmachine merge upstream
Good news: There is a good speedup when testing locally. ~~Bad news: The flamegraph looks differently now (compared with main).~~
Huh, for some reason I didn't have all your commits locally. Now the flamegraphs look the same ! Sorry for that.
💚 All backports created successfully
| Status | Branch | Result |
|---|---|---|
| ✅ | 8.5 |
Note: Successful backport PRs will be merged automatically after passing CI.
Questions ?
Please refer to the Backport tool documentation