Stacks which fail to update should not loose failed state
Today there's a few reasons why a nightly stack update could fail. I could list a few that we saw:
- SSM ParameterStore had throughput issues and stacks were not able to update
- RDS was running out of connections and stacks were not able to update
We can assume there's always going to be a reason why stack updates could fail..
Today when a stack update fails BEFORE submitting anything to CloudFormation the background task will update the RDS table and set status: FAILED on the stack.
If you then go visit this dataset or environment in the UI graphql in the background of fetching stack details will make a boto3 call to CloudFormation to check the state of the stack and it will find that the stack is actually fine and is in state CREATE_COMPLETED and it will CLEAR the FAILED state and show that the stack is healthy back on the UI.
This is obviously not OK.. This mechanism is wiping out obvious information that stack was not able to update at night. In fact we wouldnt' even know our stacks are failing at night if we didn't have a separate job monitoring stacks in RDS directly and checking if they are failed.
I believe we could solve this two ways:
- If stack update failed its failed state should not be cleared until update was able to run successfully (imo this makes most sense)
- We detach update state from stack state and show them separately. But this would be more confusing to the user why there's two status fields...