concourse-bosh-release icon indicating copy to clipboard operation
concourse-bosh-release copied to clipboard

BOSH stop under high load fails - baggageclaim

Open xtreme-sameer-vohra opened this issue 5 years ago • 0 comments

Bug Report

Under high load, BOSH deploy command fails due to baggageclaim stop failing.

We have observed this twice on Wings already.

Incident 1 We observed that http response rates were extremely slow on Wings. To rectify the issue, it was decided to restart the system and bump the stemcell.

Doing so via a BOSH deploy resulted in baggageclaim stop failing, which resulted in the deploy failing. baggageclaim stop

Manually ssh'ing onto the VM and issuing a monit restart didn't deterministically resolve the issue on a subsequent BOSH deploy.

To resolve the failed deploy, all the workers had to be stopped via stop --hard, which took multiple tries to succeed (every try, a few more workers would be stopped). Finally, another BOSH deploy got the system back into a working state.

  • Concourse version: 4.2.2
  • Deployment type (BOSH/Docker/binary): BOSH
  • Infrastructure/IaaS: GCP

Steps to Reproduce

Unfortunately, there isn't a consistent way to reproduce the issue. What was observed was that the system was under high load ( workers had ~200 containers ) and were executing builds and resource checks.

Expected Results

BOSH commands such as deploy, stop, start shouldn't fail due to baggageclaim, as this results in not being able to restore the system via BOSH and requires using increasingly destructive actions in order to eventually restore the system to a working state.

xtreme-sameer-vohra avatar Mar 22 '19 15:03 xtreme-sameer-vohra