oncall Prometeus Exporter oncall

There is a proposal to make Prometeus Exporter for application monitoring. Monitoring such a critical service is also important if it is deployed locally.

Jun 20 '22 12:06 shatovilya

@Matvey-Kuk You can try to write prometeus exporter together, I'm interested in the topic.

Jun 20 '22 12:06 shatovilya

I like the idea of prometheus exporter for oncall!

A few thoughts:

It could be done as a standard Django endpoint in our web server. It's easy to make metrics with HTTP data, rps, response codes etc.
The most critical part of OnCall infra is celery. Is it possible to get such metrics from our web server:
- Amount of tasks in each queue (to catch stuck celery worker)
- Amount of tasks executed per queue / time (to catch slow workers)
- How much time did it take to execute each task (to catch specific long tasks)
- Retried/succeed tasks

additional question is how do we make sure those endpoints are secure. Should it be done on the helm level?
may be just incorporate a few exporters (rabbit, celery, mysql)? Or better to have a unified one with specific docs about what to monitor?

Jun 20 '22 13:06 Matvey-Kuk

@Matvey-Kuk I think that for the monitoring of OnCall from infra point of view some kind of docs about how to reproduce our prod monitoring setup will be great! @shatovilya What kind of metrics you want to have? Application level, e.g count of incidents processed, amount of paged users or infra's one, like RPS, response classes, etc?

Jun 20 '22 18:06 Konstantinov-Innokentii

for the celery tasks specific metrics we can add the celery exporter such as mher/flower

Jun 22 '22 08:06 iskhakov

@Konstantinov-Innokentii It would be great to collect data:

Statistics:

count of resolved alerts group;
count of acknowledged alert groups;
count of silenced alert groups;
average time problem solving (for each integration);
average time problem detection (for each integration);

Critical statuses:

Integration status (ok, Error)
General application status(Green, Yellow, Red)

Info:

On Call application uptime
On Call version build

As a result it becomes possible:

analyze the statistics of the effectiveness of the use of the application by users;
we have the ability to create alerts for changing the status of the application.

For celery and other components (rabbit, mysql), in principle there is a ready-made exporter Prometheus. It is possible to link to them in the OnCall documentation.

Jun 24 '22 14:06 shatovilya

We added this task to the core team's backlog.

Jul 06 '22 06:07 Matvey-Kuk

Proposal - new incident numbers on timeline graph.

Something like this by logic. Will be useful to identify periods with peak numbers of new incidents.

Jul 08 '22 12:07 raphael-batte

@Konstantinov-Innokentii

count of new alert groups

Jul 11 '22 08:07 raphael-batte

@raphael-batte new incident numbers on timeline graph - good idea!

Jul 19 '22 15:07 shatovilya

@Ferril please take this one.

Specify list of metrics and labels to export. Please focus only on product metrics, not system metrics (amount of alerts = yes, celery tasks = no).
Finalize list of metrics and verify with @Matvey-Kuk or @iskhakov
Consult with @Konstantinov-Innokentii how to deploy in cloud.
Build useful dashboards for users.
Docs how to use it in Cloud & Self-Hosted.

Jan 19 '23 09:01 Matvey-Kuk

This is an example of response on /metrics endpoint with metrics for alert groups. Here we have time buckets for respond time on alert groups and amount of alert groups in every state ('new', 'ack', 'resolved', 'silence') for every integration in every team

# HELP oncall_alert_groups_response_time_seconds Alert groups in respond time (seconds)
# TYPE oncall_alert_groups_response_time_seconds histogram
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana 👻🍺",le="60.0",team="SchneckenHaus"} 3.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana 👻🍺",le="300.0",team="SchneckenHaus"} 3.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana 👻🍺",le="600.0",team="SchneckenHaus"} 4.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana 👻🍺",le="3600.0",team="SchneckenHaus"} 5.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana 👻🍺",le="+Inf",team="SchneckenHaus"} 5.0
oncall_alert_groups_response_time_seconds_count{integration="Grafana 👻🍺",team="SchneckenHaus"} 5.0
oncall_alert_groups_response_time_seconds_sum{integration="Grafana 👻🍺",team="SchneckenHaus"} 3823.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting 👻🔥",le="60.0",team="General"} 4.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting 👻🔥",le="300.0",team="General"} 5.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting 👻🔥",le="600.0",team="General"} 5.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting 👻🔥",le="3600.0",team="General"} 5.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting 👻🔥",le="+Inf",team="General"} 6.0
oncall_alert_groups_response_time_seconds_count{integration="Grafana Alerting 👻🔥",team="General"} 6.0
oncall_alert_groups_response_time_seconds_sum{integration="Grafana Alerting 👻🔥",team="General"} 89421.0
# HELP oncall_alert_groups_response_time_seconds_created Alert groups in respond time (seconds)
# TYPE oncall_alert_groups_response_time_seconds_created gauge
oncall_alert_groups_response_time_seconds_created{integration="Grafana 👻🍺",team="SchneckenHaus"} 1.676019096859997e+09
oncall_alert_groups_response_time_seconds_created{integration="Grafana Alerting 👻🔥",team="General"} 1.676019096868854e+09
# HELP oncall_alert_groups_total All alert groups
# TYPE oncall_alert_groups_total gauge
oncall_alert_groups_total{state="new",integration="Manual incidents (General team)",team="General"} 1
oncall_alert_groups_total{state="silenced",integration="Manual incidents (General team)",team="General"} 0
oncall_alert_groups_total{state="acknowledged",integration="Manual incidents (General team)",team="General"} 0
oncall_alert_groups_total{state="resolved",integration="Manual incidents (General team)",team="General"} 1
oncall_alert_groups_total{state="new",integration="Webhook ❤️",team="General"} 2
oncall_alert_groups_total{state="silenced",integration="Webhook ❤️",team="General"} 0
oncall_alert_groups_total{state="acknowledged",integration="Webhook ❤️",team="General"} 0
oncall_alert_groups_total{state="resolved",integration="Webhook ❤️",team="General"} 2
oncall_alert_groups_total{state="new",integration="Grafana 👻🍺",team="SchneckenHaus"} 0
oncall_alert_groups_total{state="silenced",integration="Grafana 👻🍺",team="SchneckenHaus"} 0
oncall_alert_groups_total{state="acknowledged",integration="Grafana 👻🍺",team="SchneckenHaus"} 2
oncall_alert_groups_total{state="resolved",integration="Grafana 👻🍺",team="SchneckenHaus"} 3
oncall_alert_groups_total{state="new",integration="Grafana Alerting 👻🔥",team="General"} 5
oncall_alert_groups_total{state="silenced",integration="Grafana Alerting 👻🔥",team="General"} 0
oncall_alert_groups_total{state="acknowledged",integration="Grafana Alerting 👻🔥",team="General"} 2
oncall_alert_groups_total{state="resolved",integration="Grafana Alerting 👻🔥",team="General"} 4

Feb 09 '23 09:02 Ferril

Hey there! 👋

Any updates on this thing, guys?

Apr 14 '23 23:04 atkrv

Self-answer: https://github.com/grafana/oncall/pull/1605

Apr 26 '23 08:04 atkrv

Closing this issue as completed. We're going to add more metrics, will open individual issues there :)

Jun 26 '23 12:06 Matvey-Kuk

oncall oncall copied to clipboard

Prometeus Exporter oncall

oncall
oncall copied to clipboard