lemur icon indicating copy to clipboard operation
lemur copied to clipboard

Recommended set of alerts for production usage

Open bobmshannon opened this issue 4 years ago • 1 comments

As a user, it would be nice if the documentation included guidance on a baseline set of alerts that should be used for monitoring Lemur based on the telemetry it emits by default. This may include things like whether Lemur is running or not, whether any certificate(s) are expiring soon, and perhaps more importantly whether any periodical certificate rotation tasks failed or not. I realize such alerts may be defined in different ways depending on the monitoring system in use, so I was thinking a high level description of each area to monitor and reference to any relevant metric(s) would still be useful to have.

Maybe others can chime in on what kind of things they're monitoring when using Lemur in production as well.

bobmshannon avatar Nov 22 '21 00:11 bobmshannon

That is an excellent suggestion, things on top of my head:

  • high rate of 4xx
  • high rate of 5xx
  • high response time
  • general Celery tasks running
  • individual celery tasks running
  • CA API issues
  • expiring deployed cert detected
  • Lemur RDS high CPU
  • cert Reissue Failure (for instance domain no longer valid, or some other issues)
  • cert Rotation Failure (e.g., issues with access to endpoint)

hosssha avatar Nov 24 '21 17:11 hosssha