kibana icon indicating copy to clipboard operation
kibana copied to clipboard

[Profiling] Reduce CPU usage for flamegraph and TopN function

Open jbcrail opened this issue 3 years ago • 5 comments

Part of https://github.com/elastic/prodfiler/issues/2630. Re-opens https://github.com/elastic/kibana/pull/140774.

This PR reduces how much work is needed to generate a flamegraph.

The original motivation was to reduce CPU usage and response time for flamegraphs. An unattended side effect is that the TopN functions API has its own modest performance improvement (see benchmark results below).

Several commits address the following minor but beneficial updates:

  • inlined flamegraph post-processing pipeline into respective route
  • additional fine-grained instrumentation to measure flamegraph via APM and logger
  • miscellaneous micro-optimizations (e.g. pre-allocation, pre-calculation, move calls out of hot path, etc)
  • remove extraneous intermediate data structure (CallerCalleeIntermediateNode)
  • miscellaneous refactoring

The bulk of the performance improvements falls into two categories:

  1. minimize or eliminate unnecessary allocations
  2. relax properties where appropriate to reduce CPU-related activity

In the first category, after collecting the Elasticsearch data originally, we first construct trace metadata for later use (i.e. groupStackFrameMetadataByStackTrace). Then we construct a tree out of the stacktraces (i.e. createCallerCalleeIntermediateRoot), merging similar sibling frames based on some criteria (i.e. FrameGroup).

The problem is that the ratio between the number of stack frames processed in the first step versus the second step can be quite different (roughly 80% smaller based on local data -- 300k frames vs 50k frames). This means that we create many StackFrameMetadata objects that are eventually unused, thus, increasing pressure on the GC and unnecessarily using the CPU.

I resolved this by delaying object creation until absolutely necessary via the LazyStackFrameMetadata interface. This interface only contains the associated frame group, frame group ID, and an index into the array where the relevant frame information will be pulled from.

In the second category, we originally sorted sibling nodes in the constructed tree by samples and frame group. Based on profiling, it was discovered that sorting took a significant amount of time. So instead of sorting partially by frame group, we now sort by samples and frame group ID. This has the benefit of being faster without losing the determinism we want when generating a flamegraph.

Flamegraph Benchmarks

main is our baseline for comparison and flamegraph is the branch associated with this PR.

All metrics are in milliseconds.

Experiment Seconds Minimum Q1 Median Q3 P90 P95 P99 Maximum
main 900 2481 2542 2574 2625 2721 2781 2918 4141
main 1800 2688 2741 2769 2805 2928 2958 3015 3029
main 3600 2498 2556 2585 2625 2753 2792 2829 2938
main 86400 2637 2701 2733 2770 2859 2907 2972 3028
flamegraph 900 -13.78% -13.26% -13.01% -11.50% -11.43% -11.79% -6.44% -26.32%
flamegraph 1800 -11.83% -11.38% -11.48% -11.44% -13.52% -12.64% -12.04% -11.29%
flamegraph 3600 -14.61% -14.51% -14.43% -14.70% -17.51% -17.87% -17.50% -14.87%
flamegraph 86400 -12.48% -12.18% -11.93% -12.09% -13.82% -13.59% -13.63% -11.29%

TopN Functions Benchmarks

main is our baseline for comparison and functions is the branch associated with this PR.

All metrics are in milliseconds.

Experiment Seconds Minimum Q1 Median Q3 P90 P95 P99 Maximum
main 900 2030 2093 2117 2146 2181 2246 2356 2366
main 1800 2269 2338 2374 2408 2453 2546 2632 2663
main 3600 2057 2106 2129 2157 2198 2318 2370 2404
main 86400 2147 2221 2245 2277 2325 2397 2461 2488
functions 900 -2.81% -3.73% -3.64% -3.45% -3.62% -5.34% -7.89% -5.37%
functions 1800 -1.94% -2.22% -2.57% -2.57% -2.77% -4.67% -4.33% -4.09%
functions 3600 -4.47% -3.94% -3.90% -3.80% -3.59% -6.82% -5.99% -5.74%
functions 86400 -3.73% -3.65% -3.61% -3.78% -3.27% -3.46% -5.12% -5.06%

jbcrail avatar Sep 20 '22 04:09 jbcrail

@elasticmachine merge upstream

rockdaboot avatar Sep 20 '22 07:09 rockdaboot

Is there some code missing perhaps ? You benchmarked, but I can't build (same errors as the CI reports).

rockdaboot avatar Sep 20 '22 10:09 rockdaboot

@rockdaboot The groupStackFrameMetadataByStackTrace function was missing since I removed it thinking that it was no longer needed. It's only needed for the TopN stacktraces, so the above benchmarks are still valid.

jbcrail avatar Sep 20 '22 18:09 jbcrail

@elasticmachine merge upstream

jbcrail avatar Sep 20 '22 18:09 jbcrail

:green_heart: Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
profiling 369.5KB 369.6KB +77.0B

History

  • :broken_heart: Build #74859 failed 6bd71b7b8b19ded310659f44a54d3c26746d7c8c
  • :yellow_heart: Build #74680 was flaky 2e08b0ced57517b65d8cdbf7ceb59eea9cb2a5dc
  • :green_heart: Build #74580 succeeded cb673bd660237d6b73f8393e60ed912f31241c73
  • :broken_heart: Build #74293 failed 81c1524fa62d6a78fbd8a9b0f2ca5f6e67e542c3
  • :green_heart: Build #74243 succeeded 01d3095f1ada1f16d333fcd012708327afdc3afc
  • :green_heart: Build #74161 succeeded 8d21964004361bc822ca98cbba16793779f264e2

To update your PR or re-run it, just comment with: @elasticmachine merge upstream

kibana-ci avatar Sep 22 '22 14:09 kibana-ci

Good news: There is a good speedup when testing locally. ~~Bad news: The flamegraph looks differently now (compared with main).~~

rockdaboot avatar Sep 22 '22 15:09 rockdaboot

Huh, for some reason I didn't have all your commits locally. Now the flamegraphs look the same ! Sorry for that.

rockdaboot avatar Sep 22 '22 16:09 rockdaboot

💚 All backports created successfully

Status Branch Result
8.5

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine avatar Sep 22 '22 16:09 kibanamachine