datahub icon indicating copy to clipboard operation
datahub copied to clipboard

2022-12-02 hub outage debrief!

Open balajialg opened this issue 2 years ago • 4 comments

We had an outage for 2 hours between 3 - 5 PM PST today which affected all users (mainly Data 8 considering that they had an assignment due today which had to get postponed by a day). Thanks to the efforts of @ryanlovett @yuvipanda and @felder we were able to get the hubs to be stable. We should debrief this issue next week extensively as we had a relatively stable last 3 weeks and made no major updates to the infrastructure in recent times.

image

We should write an incident report after doing an extensive analysis when we are back next week.

balajialg avatar Dec 03 '22 02:12 balajialg

Drafted the outage and resolution communication process which needs to get followed whenever there is a major outage with all hubs - https://docs.google.com/document/d/1E32D-FAcFFpU5oRSzAAqfhUHMLvm3_K6Bv6nl9S2-DU/edit?usp=sharing. I am defining any major outage as "hubs not available for users for more than 30 minutes". If anyone has further inputs then please add your comments in the Google doc.

balajialg avatar Dec 06 '22 00:12 balajialg

I will create a PR which has the template for incident report. @ryanlovett Can you review the information you shared in the Slack thread and add relevant information to the PR I am going to create in the next few days.

balajialg avatar Dec 14 '22 02:12 balajialg

@balajialg Sure, just link to the PR in a comment here and I'll transfer info to it.

ryanlovett avatar Dec 15 '22 01:12 ryanlovett

@ryanlovett FYI - Created a PR for incident report template - https://github.com/berkeley-dsep-infra/datahub/pull/4006. Let me know what your thoughts are!

balajialg avatar Dec 16 '22 21:12 balajialg