webi-installers icon indicating copy to clipboard operation
webi-installers copied to clipboard

[Fixed] Webinstall.dev was down for several minutes.

Open coolaj86 opened this issue 8 months ago • 2 comments

Problem

17 minutes of downtime today June 5th from 18:39 to 18:56 UTC.

Retrospective

What happened:

  1. There was a typo in the Authorization header, so authorization was not correctly sent
  2. Production makes many requests, quickly reaching the rate limits (which cannot easily be mocked in testing)
  3. The error is being thrown in an async function, which caused server restart
  4. The server refreshes at least one random package on start, which caused failure on start
  5. Successive failures in rapid succession caused systemctl to abort relaunching

What to do about:

  1. Fix the typo.
    This passed review without notice. It couldn't have been reasonably caught in testing. The typo was a valid word, so it wasn't caught by spellcheck either. 🤷‍♂️ As humans we make mistakes.
  2. Reconsider the error handling.
    Not sure if this category of error should cause this level of failure or not. The severity of the failure made it easy to identify and, since a user can't directly invoke this sort of failure remotely, it doesn't seems to present an attack vector.
  3. More time between hotfixes and refactors.
    "While we're here, might as well..." was the really root cause. It was not necessary to switch to using fetch (#852) in order to solve #850, #851. Even thought the commits were distinct, the process was not. If I had waited for a truly separate review on that change, the error would certainly have been more likely to be caught (i.e. review fatigue).

Status Updates

Not sure why yet. Investigating.

Possibly related to the change in fetching github releases and a difference between the staging and production environment.

coolaj86 avatar Jun 05 '24 18:06 coolaj86