oncall
oncall copied to clipboard
Prometeus Exporter oncall
There is a proposal to make Prometeus Exporter for application monitoring. Monitoring such a critical service is also important if it is deployed locally.
@Matvey-Kuk You can try to write prometeus exporter together, I'm interested in the topic.
I like the idea of prometheus exporter for oncall!
A few thoughts:
- It could be done as a standard Django endpoint in our web server. It's easy to make metrics with HTTP data, rps, response codes etc.
- The most critical part of OnCall infra is celery. Is it possible to get such metrics from our web server:
- Amount of tasks in each queue (to catch stuck celery worker)
- Amount of tasks executed per queue / time (to catch slow workers)
- How much time did it take to execute each task (to catch specific long tasks)
- Retried/succeed tasks
- additional question is how do we make sure those endpoints are secure. Should it be done on the helm level?
- may be just incorporate a few exporters (rabbit, celery, mysql)? Or better to have a unified one with specific docs about what to monitor?
@Matvey-Kuk I think that for the monitoring of OnCall from infra point of view some kind of docs about how to reproduce our prod monitoring setup will be great! @shatovilya What kind of metrics you want to have? Application level, e.g count of incidents processed, amount of paged users or infra's one, like RPS, response classes, etc?
for the celery tasks specific metrics we can add the celery exporter such as mher/flower
@Konstantinov-Innokentii It would be great to collect data:
Statistics:
- count of resolved alerts group;
- count of acknowledged alert groups;
- count of silenced alert groups;
- average time problem solving (for each integration);
- average time problem detection (for each integration);
Critical statuses:
- Integration status (ok, Error)
- General application status(Green, Yellow, Red)
Info:
- On Call application uptime
- On Call version build
As a result it becomes possible:
- analyze the statistics of the effectiveness of the use of the application by users;
- we have the ability to create alerts for changing the status of the application.
For celery and other components (rabbit, mysql), in principle there is a ready-made exporter Prometheus. It is possible to link to them in the OnCall documentation.
We added this task to the core team's backlog.
Proposal - new incident numbers on timeline graph.
Something like this by logic. Will be useful to identify periods with peak numbers of new incidents.
@Konstantinov-Innokentii
- count of new alert groups
@raphael-batte new incident numbers on timeline graph - good idea!
@Ferril please take this one.
- Specify list of metrics and labels to export. Please focus only on product metrics, not system metrics (amount of alerts = yes, celery tasks = no).
- Finalize list of metrics and verify with @Matvey-Kuk or @iskhakov
- Consult with @Konstantinov-Innokentii how to deploy in cloud.
- Build useful dashboards for users.
- Docs how to use it in Cloud & Self-Hosted.
This is an example of response on /metrics
endpoint with metrics for alert groups. Here we have time buckets for respond time on alert groups and amount of alert groups in every state ('new', 'ack', 'resolved', 'silence') for every integration in every team
# HELP oncall_alert_groups_response_time_seconds Alert groups in respond time (seconds)
# TYPE oncall_alert_groups_response_time_seconds histogram
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana 👻🍺",le="60.0",team="SchneckenHaus"} 3.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana 👻🍺",le="300.0",team="SchneckenHaus"} 3.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana 👻🍺",le="600.0",team="SchneckenHaus"} 4.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana 👻🍺",le="3600.0",team="SchneckenHaus"} 5.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana 👻🍺",le="+Inf",team="SchneckenHaus"} 5.0
oncall_alert_groups_response_time_seconds_count{integration="Grafana 👻🍺",team="SchneckenHaus"} 5.0
oncall_alert_groups_response_time_seconds_sum{integration="Grafana 👻🍺",team="SchneckenHaus"} 3823.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting 👻🔥",le="60.0",team="General"} 4.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting 👻🔥",le="300.0",team="General"} 5.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting 👻🔥",le="600.0",team="General"} 5.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting 👻🔥",le="3600.0",team="General"} 5.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting 👻🔥",le="+Inf",team="General"} 6.0
oncall_alert_groups_response_time_seconds_count{integration="Grafana Alerting 👻🔥",team="General"} 6.0
oncall_alert_groups_response_time_seconds_sum{integration="Grafana Alerting 👻🔥",team="General"} 89421.0
# HELP oncall_alert_groups_response_time_seconds_created Alert groups in respond time (seconds)
# TYPE oncall_alert_groups_response_time_seconds_created gauge
oncall_alert_groups_response_time_seconds_created{integration="Grafana 👻🍺",team="SchneckenHaus"} 1.676019096859997e+09
oncall_alert_groups_response_time_seconds_created{integration="Grafana Alerting 👻🔥",team="General"} 1.676019096868854e+09
# HELP oncall_alert_groups_total All alert groups
# TYPE oncall_alert_groups_total gauge
oncall_alert_groups_total{state="new",integration="Manual incidents (General team)",team="General"} 1
oncall_alert_groups_total{state="silenced",integration="Manual incidents (General team)",team="General"} 0
oncall_alert_groups_total{state="acknowledged",integration="Manual incidents (General team)",team="General"} 0
oncall_alert_groups_total{state="resolved",integration="Manual incidents (General team)",team="General"} 1
oncall_alert_groups_total{state="new",integration="Webhook ❤️",team="General"} 2
oncall_alert_groups_total{state="silenced",integration="Webhook ❤️",team="General"} 0
oncall_alert_groups_total{state="acknowledged",integration="Webhook ❤️",team="General"} 0
oncall_alert_groups_total{state="resolved",integration="Webhook ❤️",team="General"} 2
oncall_alert_groups_total{state="new",integration="Grafana 👻🍺",team="SchneckenHaus"} 0
oncall_alert_groups_total{state="silenced",integration="Grafana 👻🍺",team="SchneckenHaus"} 0
oncall_alert_groups_total{state="acknowledged",integration="Grafana 👻🍺",team="SchneckenHaus"} 2
oncall_alert_groups_total{state="resolved",integration="Grafana 👻🍺",team="SchneckenHaus"} 3
oncall_alert_groups_total{state="new",integration="Grafana Alerting 👻🔥",team="General"} 5
oncall_alert_groups_total{state="silenced",integration="Grafana Alerting 👻🔥",team="General"} 0
oncall_alert_groups_total{state="acknowledged",integration="Grafana Alerting 👻🔥",team="General"} 2
oncall_alert_groups_total{state="resolved",integration="Grafana Alerting 👻🔥",team="General"} 4
Hey there! 👋
Any updates on this thing, guys?
Self-answer: https://github.com/grafana/oncall/pull/1605
Closing this issue as completed. We're going to add more metrics, will open individual issues there :)