kepler icon indicating copy to clipboard operation
kepler copied to clipboard

RFC: Async

Open dave-tucker opened this issue 1 year ago • 2 comments

This PR contains some fixes to the way async/waitgroups were being used previously and allows for Kepler to be gracefully shutdown - this doesn't appear to have any measureable performance impact.

The last commit in this series is an experiment on asynchronously collecting the Process Metrics (from eBPF) instead of doing so synchronously (i.e from the Update method of the manager).

You can see the result of this on the collector update here: https://gist.github.com/dave-tucker/57f944a586e0d4463c5e19a586a585c7

I'll want to get a pprof of this and compare to Kepler on main also before we go further with this.

TL;DR The collector update takes on average 600ms less, and the variance (StdDev) reduces by 500ms.

I opened this PR to get some feedback as I'm quite keen to push forward in adapting the Kepler architecture to be more async. I'll open an issue and provide a diagram with what I'm thinking shortly...

dave-tucker avatar May 23 '24 10:05 dave-tucker

🤖 SeineSailor

Here's a concise summary of the pull request changes:

Summary: This pull request introduces significant changes to improve the performance and shutdown handling of Kepler. The key modifications include:

  1. Graceful shutdown: The main function now handles SIGINT and SIGTERM signals, allowing for a graceful shutdown of Kepler without performance impact.
  2. Async/WaitGroup changes: The use of async/waitgroups has been refactored to improve performance, resulting in a significant reduction in collector update time (600ms on average, 500ms in StdDev).
  3. Process Metrics collection: The CollectProcesses function has been replaced with the Start method, which collects metrics at regular intervals using a ticker and sends them to a channel.

Impact: These changes may affect the external interface and behavior of the code. The removal of the finalizing() function and addition of the rootHandler() function may require updates to dependent components.

Suggestions:

  • Verify that the new shutdown handling mechanism does not introduce any regressions in Kepler's functionality.
  • Review the updated Start method to ensure it correctly handles errors and edge cases.
  • Consider adding additional logging or monitoring to track the performance improvements and identify potential issues.

Overall, this pull request addresses significant performance and shutdown handling concerns, but requires careful review and testing to ensure its correctness and compatibility with the rest of the codebase.

github-actions[bot] avatar May 23 '24 10:05 github-actions[bot]

This PR contains some fixes to the way async/waitgroups were being used previously and allows for Kepler to be gracefully shutdown - this doesn't appear to have any measureable performance impact.

A nice library I have come across which abstracts this use-case is - https://github.com/oklog/run?tab=readme-ov-file .

See usages

  • In Thanos: https://github.com/thanos-io/thanos/blob/main/cmd/thanos/main.go#L96
  • In Prometheus: https://github.com/prometheus/prometheus/blob/8894d65cd6135635c4ac9cf100464e6aacd29593/cmd/prometheus/main.go#L954

sthaha avatar May 27 '24 02:05 sthaha