oncall icon indicating copy to clipboard operation
oncall copied to clipboard

Prometeus Exporter oncall

Open shatovilya opened this issue 2 years ago • 9 comments

There is a proposal to make Prometeus Exporter for application monitoring. Monitoring such a critical service is also important if it is deployed locally.

shatovilya avatar Jun 20 '22 12:06 shatovilya

@Matvey-Kuk You can try to write prometeus exporter together, I'm interested in the topic.

shatovilya avatar Jun 20 '22 12:06 shatovilya

I like the idea of prometheus exporter for oncall!

A few thoughts:

  • It could be done as a standard Django endpoint in our web server. It's easy to make metrics with HTTP data, rps, response codes etc.
  • The most critical part of OnCall infra is celery. Is it possible to get such metrics from our web server:
    • Amount of tasks in each queue (to catch stuck celery worker)
    • Amount of tasks executed per queue / time (to catch slow workers)
    • How much time did it take to execute each task (to catch specific long tasks)
    • Retried/succeed tasks
  • additional question is how do we make sure those endpoints are secure. Should it be done on the helm level?
  • may be just incorporate a few exporters (rabbit, celery, mysql)? Or better to have a unified one with specific docs about what to monitor?

Matvey-Kuk avatar Jun 20 '22 13:06 Matvey-Kuk

@Matvey-Kuk I think that for the monitoring of OnCall from infra point of view some kind of docs about how to reproduce our prod monitoring setup will be great! @shatovilya What kind of metrics you want to have? Application level, e.g count of incidents processed, amount of paged users or infra's one, like RPS, response classes, etc?

Konstantinov-Innokentii avatar Jun 20 '22 18:06 Konstantinov-Innokentii

for the celery tasks specific metrics we can add the celery exporter such as mher/flower

iskhakov avatar Jun 22 '22 08:06 iskhakov

@Konstantinov-Innokentii It would be great to collect data:

Statistics:

  • count of resolved alerts group;
  • count of acknowledged alert groups;
  • count of silenced alert groups;
  • average time problem solving (for each integration);
  • average time problem detection (for each integration);

Critical statuses:

  • Integration status (ok, Error)
  • General application status(Green, Yellow, Red)

Info:

  • On Call application uptime
  • On Call version build

As a result it becomes possible:

  • analyze the statistics of the effectiveness of the use of the application by users;
  • we have the ability to create alerts for changing the status of the application.

For celery and other components (rabbit, mysql), in principle there is a ready-made exporter Prometheus. It is possible to link to them in the OnCall documentation.

shatovilya avatar Jun 24 '22 14:06 shatovilya

We added this task to the core team's backlog.

Matvey-Kuk avatar Jul 06 '22 06:07 Matvey-Kuk

Proposal - new incident numbers on timeline graph.

Something like this by logic. Will be useful to identify periods with peak numbers of new incidents.

Image

raphael-batte avatar Jul 08 '22 12:07 raphael-batte

@Konstantinov-Innokentii

  • count of new alert groups

raphael-batte avatar Jul 11 '22 08:07 raphael-batte

@raphael-batte new incident numbers on timeline graph - good idea!

shatovilya avatar Jul 19 '22 15:07 shatovilya

@Ferril please take this one.

  1. Specify list of metrics and labels to export. Please focus only on product metrics, not system metrics (amount of alerts = yes, celery tasks = no).
  2. Finalize list of metrics and verify with @Matvey-Kuk or @iskhakov
  3. Consult with @Konstantinov-Innokentii how to deploy in cloud.
  4. Build useful dashboards for users.
  5. Docs how to use it in Cloud & Self-Hosted.

Matvey-Kuk avatar Jan 19 '23 09:01 Matvey-Kuk

This is an example of response on /metrics endpoint with metrics for alert groups. Here we have time buckets for respond time on alert groups and amount of alert groups in every state ('new', 'ack', 'resolved', 'silence') for every integration in every team

# HELP oncall_alert_groups_response_time_seconds Alert groups in respond time (seconds)
# TYPE oncall_alert_groups_response_time_seconds histogram
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana 👻🍺",le="60.0",team="SchneckenHaus"} 3.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana 👻🍺",le="300.0",team="SchneckenHaus"} 3.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana 👻🍺",le="600.0",team="SchneckenHaus"} 4.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana 👻🍺",le="3600.0",team="SchneckenHaus"} 5.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana 👻🍺",le="+Inf",team="SchneckenHaus"} 5.0
oncall_alert_groups_response_time_seconds_count{integration="Grafana 👻🍺",team="SchneckenHaus"} 5.0
oncall_alert_groups_response_time_seconds_sum{integration="Grafana 👻🍺",team="SchneckenHaus"} 3823.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting 👻🔥",le="60.0",team="General"} 4.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting 👻🔥",le="300.0",team="General"} 5.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting 👻🔥",le="600.0",team="General"} 5.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting 👻🔥",le="3600.0",team="General"} 5.0
oncall_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting 👻🔥",le="+Inf",team="General"} 6.0
oncall_alert_groups_response_time_seconds_count{integration="Grafana Alerting 👻🔥",team="General"} 6.0
oncall_alert_groups_response_time_seconds_sum{integration="Grafana Alerting 👻🔥",team="General"} 89421.0
# HELP oncall_alert_groups_response_time_seconds_created Alert groups in respond time (seconds)
# TYPE oncall_alert_groups_response_time_seconds_created gauge
oncall_alert_groups_response_time_seconds_created{integration="Grafana 👻🍺",team="SchneckenHaus"} 1.676019096859997e+09
oncall_alert_groups_response_time_seconds_created{integration="Grafana Alerting 👻🔥",team="General"} 1.676019096868854e+09
# HELP oncall_alert_groups_total All alert groups
# TYPE oncall_alert_groups_total gauge
oncall_alert_groups_total{state="new",integration="Manual incidents (General team)",team="General"} 1
oncall_alert_groups_total{state="silenced",integration="Manual incidents (General team)",team="General"} 0
oncall_alert_groups_total{state="acknowledged",integration="Manual incidents (General team)",team="General"} 0
oncall_alert_groups_total{state="resolved",integration="Manual incidents (General team)",team="General"} 1
oncall_alert_groups_total{state="new",integration="Webhook ❤️",team="General"} 2
oncall_alert_groups_total{state="silenced",integration="Webhook ❤️",team="General"} 0
oncall_alert_groups_total{state="acknowledged",integration="Webhook ❤️",team="General"} 0
oncall_alert_groups_total{state="resolved",integration="Webhook ❤️",team="General"} 2
oncall_alert_groups_total{state="new",integration="Grafana 👻🍺",team="SchneckenHaus"} 0
oncall_alert_groups_total{state="silenced",integration="Grafana 👻🍺",team="SchneckenHaus"} 0
oncall_alert_groups_total{state="acknowledged",integration="Grafana 👻🍺",team="SchneckenHaus"} 2
oncall_alert_groups_total{state="resolved",integration="Grafana 👻🍺",team="SchneckenHaus"} 3
oncall_alert_groups_total{state="new",integration="Grafana Alerting 👻🔥",team="General"} 5
oncall_alert_groups_total{state="silenced",integration="Grafana Alerting 👻🔥",team="General"} 0
oncall_alert_groups_total{state="acknowledged",integration="Grafana Alerting 👻🔥",team="General"} 2
oncall_alert_groups_total{state="resolved",integration="Grafana Alerting 👻🔥",team="General"} 4

Ferril avatar Feb 09 '23 09:02 Ferril

Hey there! 👋

Any updates on this thing, guys?

atkrv avatar Apr 14 '23 23:04 atkrv

Self-answer: https://github.com/grafana/oncall/pull/1605

atkrv avatar Apr 26 '23 08:04 atkrv

Closing this issue as completed. We're going to add more metrics, will open individual issues there :)

Matvey-Kuk avatar Jun 26 '23 12:06 Matvey-Kuk