capi-release
capi-release copied to clipboard
cloud_controller_worker pre-backup-lock hangs when there are 10 or more cloud_controller_workers
Thanks for submitting an issue to capi-release
. We are always trying to improve! To help us, please fill out the following template.
Issue
cloud_controller_worker pre-backup-lock hangs when there are 10 or more cloud_controller_workers
Context
We ran into issues when trying use bbr to backup our cf deployment.
We deployed cf with this ops file:
- type: replace
path: /instance_groups/name=cc-worker/jobs/name=cloud_controller_worker/properties/cc/broker_client_default_async_poll_interval_seconds?
value: 10
- type: replace
path: /instance_groups/name=cc-worker/jobs/name=cloud_controller_worker/properties/cc/jobs?/generic?/number_of_workers?
value: 11
- type: replace
path: /instance_groups/name=cc-worker/vm_type
value: medium
We observed these logs
...
[bbr] 2022/09/13 15:31:55 INFO - Finished locking cloud_controller_clock on scheduler/b468a5cd-130a-442a-bcab-bfbc40cb4ab5 for backup.
[bbr] 2022/09/13 15:32:03 INFO - Finished locking cloud_controller_ng on api/dc671209-452f-43c9-8077-3927b614ffad for backup.
[bbr] 2022/09/13 15:32:06 INFO - Finished locking cloud_controller_ng on api/5859cb84-76ed-492f-b03f-023532aa352e for backup.
On the vm we see
cc-worker/ff736dfe-8c81-4528-98a6-5698ba183123:~# monit summary
The Monit daemon 5.2.5 uptime: 2d 3h 22m
Process 'cloud_controller_worker_1' not monitored
Process 'cloud_controller_worker_2' running
Process 'cloud_controller_worker_3' running
Process 'cloud_controller_worker_4' running
Process 'cloud_controller_worker_5' running
Process 'cloud_controller_worker_6' running
Process 'cloud_controller_worker_7' running
Process 'cloud_controller_worker_8' running
Process 'cloud_controller_worker_9' running
Process 'cloud_controller_worker_10' running
Process 'cloud_controller_worker_11' running
Process 'loggregator_agent' running
Process 'loggr-forwarder-agent' running
Process 'loggr-syslog-agent' running
Process 'prom_scraper' running
Process 'metrics-discovery-registrar' running
Process 'metrics-agent' running
Process 'bosh-dns' running
Process 'bosh-dns-resolvconf' running
Process 'bosh-dns-healthcheck' running
On the vm we also see
cc-worker/ff736dfe-8c81-4528-98a6-5698ba183123:~# /var/vcap/jobs/cloud_controller_worker/bin/bbr/pre-backup-lock
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Steps to Reproduce
- deploy cf with
- type: replace
path: /instance_groups/name=cc-worker/jobs/name=cloud_controller_worker/properties/cc/broker_client_default_async_poll_interval_seconds?
value: 10
- type: replace
path: /instance_groups/name=cc-worker/jobs/name=cloud_controller_worker/properties/cc/jobs?/generic?/number_of_workers?
value: 11
- type: replace
path: /instance_groups/name=cc-worker/vm_type
value: medium
and bbr
- type: replace
path: /releases/-
value:
name: backup-and-restore-sdk
sha1: 238c36f2229f303ebf96f6b24b29799232195e38
url: https://bosh.io/d/github.com/cloudfoundry-incubator/backup-and-restore-sdk-release?v=1.18.52
version: 1.18.52
- type: replace
path: /instance_groups/-
value:
azs:
- z1
instances: 1
jobs:
- name: database-backup-restorer
release: backup-and-restore-sdk
- name: bbr-cfnetworkingdb
properties:
release_level_backup: true
release: cf-networking
- name: bbr-cloudcontrollerdb
release: capi
- name: bbr-routingdb
release: routing
- name: bbr-uaadb
properties:
release_level_backup: true
release: uaa
- name: bbr-credhubdb
properties:
release_level_backup: true
release: credhub
- name: cf-cli-6-linux
release: cf-cli
name: backup-restore
networks:
- name: default
persistent_disk_type: 10GB
stemcell: default
vm_type: minimal
- type: replace
path: /instance_groups/name=api/jobs/name=routing-api/properties/release_level_backup?
value: true
- try to take a backup of cf with bbr
$ bbr deployment --deployment cf backup
- observe the failure
...
[bbr] 2022/09/13 15:31:55 INFO - Finished locking cloud_controller_clock on scheduler/b468a5cd-130a-442a-bcab-bfbc40cb4ab5 for backup.
[bbr] 2022/09/13 15:32:03 INFO - Finished locking cloud_controller_ng on api/dc671209-452f-43c9-8077-3927b614ffad for backup.
[bbr] 2022/09/13 15:32:06 INFO - Finished locking cloud_controller_ng on api/5859cb84-76ed-492f-b03f-023532aa352e for backup.
Expected result
We expected the command in step 2 to succeed.
Current result
Currently the command is failing.
Possible Fix
This piece of code
function wait_unmonitor_job() {
local job_name="$1"
while true; do
if [[ $(sudo /var/vcap/bosh/bin/monit summary | grep ${job_name} ) =~ not[[:space:]]monitored[[:space:]]*$ ]]; then
echo "Unmonitored ${job_name}"
return 0
else
echo "Waiting for ${job_name} to be unmonitored..."
fi
sleep 0.1
done
}
seems to assume that there will always be less than 10 cloud_controller_workers.
On our vm with 11 workers
cc-worker/ff736dfe-8c81-4528-98a6-5698ba183123:~# monit summary | grep cloud_controller_worker_1
Process 'cloud_controller_worker_1' not monitored
Process 'cloud_controller_worker_10' running
Process 'cloud_controller_worker_11' running
A potential fix for this bug
.*${job_name}.*[[:space:]]not[[:space:]]monitored[[:space:]]
I have attached a patch as well (since I do not have push rights to the repo) 0001-monit_utils-Add-support-for-more-than-10-cc-workers.txt