etl
etl copied to clipboard
Staging servers have non-deterministic failures
Problem
Anecdotally, it seems like there's often a random failure in a staging server pipeline, but if you manually retry it on Buildkite it works.
Expected behaviour
We'd hope for things to work the first time
Gathering more information
- Could we periodically review recent builds on each branch?
- Could we dump all pipelines including failures and retries using the API?
- Are there additional steps that should have automatic retries