capi-release icon indicating copy to clipboard operation
capi-release copied to clipboard

cloud_controller_worker pre-backup-lock hangs when there are 10 or more cloud_controller_workers

Open ohkyle opened this issue 2 years ago • 1 comments

Thanks for submitting an issue to capi-release. We are always trying to improve! To help us, please fill out the following template.

Issue

cloud_controller_worker pre-backup-lock hangs when there are 10 or more cloud_controller_workers

Context

We ran into issues when trying use bbr to backup our cf deployment.

We deployed cf with this ops file:

- type: replace
  path: /instance_groups/name=cc-worker/jobs/name=cloud_controller_worker/properties/cc/broker_client_default_async_poll_interval_seconds?
  value: 10

- type: replace
  path: /instance_groups/name=cc-worker/jobs/name=cloud_controller_worker/properties/cc/jobs?/generic?/number_of_workers?
  value: 11

- type: replace
  path: /instance_groups/name=cc-worker/vm_type
  value: medium

We observed these logs

...
[bbr] 2022/09/13 15:31:55 INFO - Finished locking cloud_controller_clock on scheduler/b468a5cd-130a-442a-bcab-bfbc40cb4ab5 for backup.
[bbr] 2022/09/13 15:32:03 INFO - Finished locking cloud_controller_ng on api/dc671209-452f-43c9-8077-3927b614ffad for backup.
[bbr] 2022/09/13 15:32:06 INFO - Finished locking cloud_controller_ng on api/5859cb84-76ed-492f-b03f-023532aa352e for backup.

On the vm we see

cc-worker/ff736dfe-8c81-4528-98a6-5698ba183123:~# monit summary
The Monit daemon 5.2.5 uptime: 2d 3h 22m

Process 'cloud_controller_worker_1' not monitored
Process 'cloud_controller_worker_2' running
Process 'cloud_controller_worker_3' running
Process 'cloud_controller_worker_4' running
Process 'cloud_controller_worker_5' running
Process 'cloud_controller_worker_6' running
Process 'cloud_controller_worker_7' running
Process 'cloud_controller_worker_8' running
Process 'cloud_controller_worker_9' running
Process 'cloud_controller_worker_10' running
Process 'cloud_controller_worker_11' running
Process 'loggregator_agent'         running
Process 'loggr-forwarder-agent'     running
Process 'loggr-syslog-agent'        running
Process 'prom_scraper'              running
Process 'metrics-discovery-registrar' running
Process 'metrics-agent'             running
Process 'bosh-dns'                  running
Process 'bosh-dns-resolvconf'       running
Process 'bosh-dns-healthcheck'      running

On the vm we also see

cc-worker/ff736dfe-8c81-4528-98a6-5698ba183123:~# /var/vcap/jobs/cloud_controller_worker/bin/bbr/pre-backup-lock
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...

Steps to Reproduce

  1. deploy cf with
- type: replace
  path: /instance_groups/name=cc-worker/jobs/name=cloud_controller_worker/properties/cc/broker_client_default_async_poll_interval_seconds?
  value: 10

- type: replace
  path: /instance_groups/name=cc-worker/jobs/name=cloud_controller_worker/properties/cc/jobs?/generic?/number_of_workers?
  value: 11

- type: replace
  path: /instance_groups/name=cc-worker/vm_type
  value: medium

and bbr

- type: replace
  path: /releases/-
  value:
    name: backup-and-restore-sdk
    sha1: 238c36f2229f303ebf96f6b24b29799232195e38
    url: https://bosh.io/d/github.com/cloudfoundry-incubator/backup-and-restore-sdk-release?v=1.18.52
    version: 1.18.52
- type: replace
  path: /instance_groups/-
  value:
    azs:
    - z1
    instances: 1
    jobs:
    - name: database-backup-restorer
      release: backup-and-restore-sdk
    - name: bbr-cfnetworkingdb
      properties:
        release_level_backup: true
      release: cf-networking
    - name: bbr-cloudcontrollerdb
      release: capi
    - name: bbr-routingdb
      release: routing
    - name: bbr-uaadb
      properties:
        release_level_backup: true
      release: uaa
    - name: bbr-credhubdb
      properties:
        release_level_backup: true
      release: credhub
    - name: cf-cli-6-linux
      release: cf-cli
    name: backup-restore
    networks:
    - name: default
    persistent_disk_type: 10GB
    stemcell: default
    vm_type: minimal
- type: replace
  path: /instance_groups/name=api/jobs/name=routing-api/properties/release_level_backup?
  value: true
  1. try to take a backup of cf with bbr
$ bbr deployment --deployment cf backup
  1. observe the failure
...
[bbr] 2022/09/13 15:31:55 INFO - Finished locking cloud_controller_clock on scheduler/b468a5cd-130a-442a-bcab-bfbc40cb4ab5 for backup.
[bbr] 2022/09/13 15:32:03 INFO - Finished locking cloud_controller_ng on api/dc671209-452f-43c9-8077-3927b614ffad for backup.
[bbr] 2022/09/13 15:32:06 INFO - Finished locking cloud_controller_ng on api/5859cb84-76ed-492f-b03f-023532aa352e for backup.

Expected result

We expected the command in step 2 to succeed.

Current result

Currently the command is failing.

Possible Fix

This piece of code

function wait_unmonitor_job() {
  local job_name="$1"

  while true; do
    if [[ $(sudo /var/vcap/bosh/bin/monit summary | grep ${job_name} ) =~ not[[:space:]]monitored[[:space:]]*$ ]]; then
      echo "Unmonitored ${job_name}"
      return 0
    else
      echo "Waiting for ${job_name} to be unmonitored..."
    fi

    sleep 0.1
  done
}

seems to assume that there will always be less than 10 cloud_controller_workers.

On our vm with 11 workers

cc-worker/ff736dfe-8c81-4528-98a6-5698ba183123:~# monit summary | grep cloud_controller_worker_1
Process 'cloud_controller_worker_1' not monitored
Process 'cloud_controller_worker_10' running
Process 'cloud_controller_worker_11' running

ohkyle avatar Sep 13 '22 23:09 ohkyle

A potential fix for this bug

.*${job_name}.*[[:space:]]not[[:space:]]monitored[[:space:]]

I have attached a patch as well (since I do not have push rights to the repo) 0001-monit_utils-Add-support-for-more-than-10-cc-workers.txt

ohkyle avatar Sep 26 '22 22:09 ohkyle