kepler icon indicating copy to clipboard operation
kepler copied to clipboard

feat(bpf): use time window for bpf sampling to replace per call based sampling

Open rootfs opened this issue 1 year ago • 6 comments

From @vimalk78 finding, the per call based bpf sampling has very large cpu time variations.

Now changing to time window based sampling. The cpu time is much consistent and close to the probing results, while the overhead is reduced even more.

Disclaimer: some of the code is generated by ChatGPT.

Active Time (ms) Idle Time (ms) Average kepler_sched_switch_trace bpf runtime (ns)
5 95 400
20 80 875
50 50 1500
80 20 2100
1000 (default) 0 2500

rootfs avatar Aug 22 '24 01:08 rootfs

🤖 SeineSailor

Here is a concise summary of the pull request changes:

Summary: This pull request introduces significant changes to the BPF (Berkeley Packet Filter) implementation, replacing per-call sampling with time window-based sampling. This new approach reduces CPU time variation and overhead. Additionally, a minor internal change is made to the dcgm.Init() function call.

Key Modifications:

  1. Time Window-Based Sampling for BPF: The pull request replaces per-call sampling with time window-based sampling, reducing CPU time variation and overhead. This change affects multiple files, including kepler.bpf.h, exporter.go, kepler_bpfeb.go, kepler_bpfel.go, config.go, test_bpfeb.go, and test_bpfel.go.
  2. Global Parameters and BPF Maps: New global parameters for tracking and non-tracking periods are added, along with two BPF maps to manage the tracking state.
  3. Updated do_kepler_sched_switch_trace Function: The function now checks a tracking flag and updates the sampling state based on elapsed time.
  4. Minor Internal Change: The dcgm.Init() function call is updated to use config.GetDCGMHostEngineEndpoint() instead of config.DCGMHostEngineEndpoint.

Impact on Codebase:

  • The BPF implementation is significantly altered, but the external interface remains unchanged.
  • The code generated by ChatGPT may require further review.

Suggestions for Improvement:

  • It would be beneficial to include more detailed comments or documentation explaining the reasoning behind the changes and how they improve the BPF implementation.
  • Consider adding tests to verify the correctness of the new time window-based sampling approach.
  • Review the code generated by ChatGPT to ensure it meets the project's coding standards and best practices.

github-actions[bot] avatar Aug 22 '24 01:08 github-actions[bot]

converting to draft, pending test results.

rootfs avatar Aug 22 '24 15:08 rootfs

Test results:

Below is a comparison of two keplers, one with sampling window enabled (100 ms active, 1000 ms idle), other without sampling.

We can see that on bare metal, the two keplers produce very close values for package power and core power, because the ratio of bpf cpu time, with sampling, is very close to without sampling.

process cpu time, exhaustive vs sampling image

process core joules, exhaustive vs sampling image

process package joules, exhaustive vs sampling image

kepler cpu time, exhaustive vs sampling image

As expected, the kepler with sampling consumes less cpu time, and less cpu instructions compared to kepler without sampling.

vimalk78 avatar Sep 11 '24 16:09 vimalk78

@dave-tucker @sthaha @marceloamaral PTAL, thanks

rootfs avatar Sep 16 '24 20:09 rootfs

@rootfs @dave-tucker I’m concerned about the impact on VM power estimation. If we're undercounting CPU time, the power consumption will be underestimated as well.

To address this, we need to extrapolate the results, similar to how Linux handles counter multiplexing. For instance, if we collected data for only 1 second out of a 5-second window, we should multiply the results by 5 to estimate for the full 5 seconds.

All we need to do is track the collection interval and adjust the results accordingly.

marceloamaral avatar Sep 17 '24 01:09 marceloamaral

@marceloamaral good point! at the moment, the sampled cpu time is not extrapolated. We can consider different scaling factors. One approach in my plan is to find the max and min cpu time from each sample, and use the mean cpu time to extrapolate the entire active + idle duration. This will account for the variable cpu utilization conditions. If that proves effective, we then will discuss removing the EXPERIMENTAL prefix from these params. wdyt?

rootfs avatar Sep 17 '24 13:09 rootfs