datahub
datahub copied to clipboard
2022-12-02 hub outage debrief!
We had an outage for 2 hours between 3 - 5 PM PST today which affected all users (mainly Data 8 considering that they had an assignment due today which had to get postponed by a day). Thanks to the efforts of @ryanlovett @yuvipanda and @felder we were able to get the hubs to be stable. We should debrief this issue next week extensively as we had a relatively stable last 3 weeks and made no major updates to the infrastructure in recent times.
We should write an incident report after doing an extensive analysis when we are back next week.
Drafted the outage and resolution communication process which needs to get followed whenever there is a major outage with all hubs - https://docs.google.com/document/d/1E32D-FAcFFpU5oRSzAAqfhUHMLvm3_K6Bv6nl9S2-DU/edit?usp=sharing. I am defining any major outage as "hubs not available for users for more than 30 minutes". If anyone has further inputs then please add your comments in the Google doc.
I will create a PR which has the template for incident report. @ryanlovett Can you review the information you shared in the Slack thread and add relevant information to the PR I am going to create in the next few days.
@balajialg Sure, just link to the PR in a comment here and I'll transfer info to it.
@ryanlovett FYI - Created a PR for incident report template - https://github.com/berkeley-dsep-infra/datahub/pull/4006. Let me know what your thoughts are!