govuk-developer-docs icon indicating copy to clipboard operation
govuk-developer-docs copied to clipboard

Document monitoring, metrics, tracing, observability and alerting

Open AgaDufrat opened this issue 5 months ago • 2 comments

As an engineer on GOV.UK, I want to know how to configure monitoring, metrics, tracing, observability and alerting for my applications, so that we can enable proactive detection and resolution of issues, ensure optimal performance and enhance reliability by providing real time insight into applications health and behaviour, as well as to inform product decisions.

Current documentation

Logging

How logging works on GOV.UK Request tracing

Monitoring

Debug underperforming search How we handle errors Pingdom Sentry

Alerting

Pingdom Bouncer canary check Router error ratio too high Travel Advice or Drug and Medical Device email alerts not sent Signon API user token expires soon PagerDuty Things that may contact on-call - I suggest the specifics get taken out of here and instead link to the relevant pages

[WIP] Missing documentation

  • Grafana
  • App metrics, Prometheus
  • configuring Alertmanager alerts

[WIP] Documentation that could do with a refresh

Pagerduty alerts section, AlertManager alerts section and the Monitoring section - if we need a section, perhaps consolidating all of this and more under a new one of Monitoring and alerting might do?

AgaDufrat avatar Sep 12 '24 13:09 AgaDufrat