notification-api icon indicating copy to clipboard operation
notification-api copied to clipboard

Refactor alerts to distinguish between infrastructure failure, capacity, or service limits

Open mohdnr opened this issue 3 years ago • 5 comments

Candidates for refactoring:

  • [ ] logs-10-celery-error-1-minute-critical: This is currently tracking "?\"ERROR/Worker\" ?\"ERROR/ForkPoolWorker\" ?\"WorkerLostError\"" found in cloudwatch eks-cluster/application logs. This is too generic. Identify a way to distinguish between intentional thrown errors (message limits) and legitimate celery failures.

mohdnr avatar Feb 08 '22 15:02 mohdnr