web
web copied to clipboard
Add Monitors in Datadog for Pager Duty
Acceptance Criteria Add Monitors for the following (this list is WIP):
-
[x] High number of 500s
-
[ ] High Number of instances (indicates Autoscaling is working, but consuming too many resources)
-
[x] High Latency on Page Load (indicates overall site performance degradation)
-
[ ] High number of jobs enqueued in Redis (indicates celery workers aren't keeping up with demand)
-
[x] Synthetic pageload tests failing (canary uptime test)
-
[ ] Ensure Read Replica DBs are also monitored
-
[x] High Mem on DBs
-
[x] High CPU on DBs (DB load is a known bottleneck, may need to find and kill long running queries)
-
[x] High Latency on DB Querries (indicates inefficient queries, or high db load)
-
[x] High CPU on cluster (indicates Autoscaling is lagging behind demand)
XD Links:
Tech Details:
Open Questions:
Notes/Assumptions: