MITx Staging falls over periodically for no discernable reason
Today I was paged a number of times over the period of a few minutes by the following alerts:
https://opsg.in/a/i/mitodl/244c2d3f-3e18-4451-8696-743013cd995d-1698088522383 https://opsg.in/a/i/mitodl/48a3d7d8-2e00-48c9-8cfb-72bea377b914-1698088451253 https://opsg.in/a/i/mitodl/63168e6b-80e5-4157-83c3-b38da12c366b-1698088421770
We should probably look into this and prevent is from happening or adjust our alerts accordingly.
Another recent example:
https://opsg.in/a/i/mitodl/a272eb0f-50f1-454a-9959-92cf4e47d497-1703263089256
Got a metric ton of these alerts this evening, when I checked the Cloudwatch monitoring for its RDS database I see:
I wonder if we're being IO throttled. Look at those super spikey spikes in disks queue depth and LVMReadIOPs. Those can't be good.