ol-infrastructure MITx Staging falls over periodically for no discernable reason

Today I was paged a number of times over the period of a few minutes by the following alerts:

https://opsg.in/a/i/mitodl/244c2d3f-3e18-4451-8696-743013cd995d-1698088522383 https://opsg.in/a/i/mitodl/48a3d7d8-2e00-48c9-8cfb-72bea377b914-1698088451253 https://opsg.in/a/i/mitodl/63168e6b-80e5-4157-83c3-b38da12c366b-1698088421770

We should probably look into this and prevent is from happening or adjust our alerts accordingly.

Oct 23 '23 20:10 feoh

Another recent example:

https://opsg.in/a/i/mitodl/a272eb0f-50f1-454a-9959-92cf4e47d497-1703263089256

Dec 24 '23 23:12 feoh

Got a metric ton of these alerts this evening, when I checked the Cloudwatch monitoring for its RDS database I see:

I wonder if we're being IO throttled. Look at those super spikey spikes in disks queue depth and LVMReadIOPs. Those can't be good.

Dec 25 '23 05:12 feoh