etcd-druid
etcd-druid copied to clipboard
[Feature] ☂️ Monitor compaction jobs running on shoot control planes
Feature (What you would like to be added): As Druid runs in the namespace different than the shoot control plane but the compaction jobs triggered by it runs in the shoot control plane, it's not straightforward to collect the metrics of compaction jobs and create the dashboard out of it. There are a number of prometheus involved in the process that should collect and forward them to others. The compaction metrics are needed to be channelized in such a way so that it ultimately reaches to prometheus running in shoot control plane. Only then the metrics would be ready for consumption by Dashboards running in shoot control planes.
As Druid is running in Garden namespace, Cache prometheus will be able to collect the Druid controller metrics i.e. compaction metrics. Then, control plane prometheus can fedarate those metrics along with cadvisor metrics for Compaction job. We can use these scraped metrics from control plane prometheus and filter out the shoot specific compaction job metrics to show the dashboard for a particular shoot
To further enhance the visualization of compaction metrics, we can also create a dashboard in seed. The dashboard may show aggregated compaction job performance.
In my first comment, I attached an image shared by @istvanballok and @rickardsjp to better understand the flow.
Motivation (Why is this needed?): We have druids that triggers compaction jobs after a certain threshold of delta events are crossed in control plane ETCD. Compaction jobs compacts the delta events that accumulated in object storage and create full snapshots out of it. But the jobs may be heavy at certain times. and we need proper monitoring for the jobs running in each shoot control planes. Approach/Hint to the implement solution (optional):
- [x] Collect the metrics for compaction in Druid #569
- [x] Expose the metrics through Druid deployment in gardener #8014
- [x] Let Cache Prometheus scrape compaction metrics #622
- [x] Let Control plane prometheus federate cache prometheus for compaction metrics #626
- [x] Create grafana dashboard for better visualizing the Compaction metrics #504
- [x] Short term improvements for ETCD Druid to make dashboard useful for operators #648
- [x] Fix a bug for compaction job not working with local storage #709
- [x] Create alerts based on some of the compaction metrics #603
- [ ] Enhancement: Dashboard for aggregated compaction jobs running in a seed based on Cache metrics.