concourse-bosh-release
concourse-bosh-release copied to clipboard
BOSH stop under high load fails - baggageclaim
Bug Report
Under high load, BOSH deploy
command fails due to baggageclaim
stop failing.
We have observed this twice on Wings already.
Incident 1
We observed that http response
rates were extremely slow on Wings. To rectify the issue, it was decided to restart the system and bump the stemcell.
Doing so via a BOSH deploy
resulted in baggageclaim
stop failing, which resulted in the deploy failing.
Manually ssh'ing onto the VM and issuing a monit restart
didn't deterministically resolve the issue on a subsequent BOSH deploy
.
To resolve the failed deploy, all the workers had to be stopped via stop --hard
, which took multiple tries to succeed (every try, a few more workers would be stopped). Finally, another BOSH deploy
got the system back into a working state.
- Concourse version: 4.2.2
- Deployment type (BOSH/Docker/binary): BOSH
- Infrastructure/IaaS: GCP
Steps to Reproduce
Unfortunately, there isn't a consistent way to reproduce the issue. What was observed was that the system was under high load ( workers had ~200 containers ) and were executing builds and resource checks.
Expected Results
BOSH commands such as deploy
, stop
, start
shouldn't fail due to baggageclaim, as this results in not being able to restore the system via BOSH and requires using increasingly destructive actions in order to eventually restore the system to a working state.