bpm-release icon indicating copy to clipboard operation
bpm-release copied to clipboard

bpm/1.4.2 fails on a bosh-lite

Open abg opened this issue 1 year ago • 5 comments

Yesterday our pipelines picked up bpm/1.4.2 that bumped to runc/1.2.0 and environments using a bosh-lite configuration started failing.

The initial deployment is successful but cleaning up jobs later fails.

# bpm stop test-server
Error: failed to cleanup job-process: exit status 1

bpm seems to get in a bad state if I have multiple deployments and restart a couple times. Here's a reproduction using the bpm-release bosh-lite.yml test manifest.

$ bosh -n -d bpm deploy manifests/bosh-lite.yml
...success...
$ export BOSH_DEPLOYMENT=bpm-$(uuidgen)
$ bosh -n deploy manifests/bosh-lite.yml -o <(echo '[{"type":"replace","path":"/name","value":"((deployment_name))"}]') -v deployment_name=$BOSH_DEPLOYMENT
...success...
$ bosh -n restart
...success...
$ bosh -n restart
...
Task 20 | 14:31:59 | L starting jobs: bpm/33f58def-3dac-467e-bc7d-715e4a890b54 (0) (canary) (00:02:33)
                   L Error: 'bpm/33f58def-3dac-467e-bc7d-715e4a890b54 (0)' is not running after update. Review logs for failed jobs: test-server, alt-test-server
...

$ bosh ssh 
# bpm list
Name                        Pid Status
test-errand                 -   stopped
test-server                 -   failed
test-server.alt-test-server -   failed
# bpm start test-server
Error: failed to clean up stale job-process: exit status 1
# bpm stop test-server
Error: failed to clean up stale job-process: exit status 1

This may be related:

# /var/vcap/packages/bpm/bin/runc --root /var/vcap/sys/run/bpm-runc delete --force bpm-test-server
ERRO[0002] unable to destroy container: unable to remove container's cgroup: Failed to remove paths: map[net_cls:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-test-server net_prio:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-test-server]

I couldn't reproduce this on a bbl environment. I couldn't reproduce this with bpm/1.4.1.

Rolling back to bpm/1.4.1 (and runc/1.1.15) seems to resolve this issue for us.

abg avatar Oct 29 '24 16:10 abg

Poking at this a little this morning, I see runc-1.1.15 ran into the same container / cgroup teardown issue, but seemingly ignored it. runc-1.2.0 seems to hard stop when it cannot cleanup a group.

# /var/vcap/packages/bpm/bin/runc --version
runc version 1.2.0
commit: unknown
spec: 1.2.0
go: go1.23.2
libseccomp: 2.5.1
# /var/vcap/packages/bpm/bin/runc --root /var/vcap/sys/run/bpm-runc/ delete bpm-loggr-forwarder-agent
ERRO[0002] unable to destroy container: unable to remove container's cgroup: Failed to remove paths: map[net_cls:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent net_prio:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent]
# echo $?
1

# /var/vcap/packages/bpm/bin/runc-1.1.15 --version
runc version 1.1.15
commit: unknown
spec: 1.0.2-dev
go: go1.23.2
libseccomp: 2.5.3

# /var/vcap/packages/bpm/bin/runc-1.1.15 --root /var/vcap/sys/run/bpm-runc/ delete bpm-loggr-forwarder-agent
WARN[0000] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent: device or resource busy"
WARN[0000] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent: device or resource busy"
ERRO[0000] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent: device or resource busy"
ERRO[0000] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent: device or resource busy"
ERRO[0000] Failed to remove paths: map[net_cls:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent net_prio:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent]
# echo $?
0

abg avatar Oct 31 '24 15:10 abg

We found the changes in runc that lead to this behavioral change: https://github.com/opencontainers/runc/commit/a6f4081766a0f405bb9b5e798a4930c1f434c6b1 and https://github.com/opencontainers/runc/commit/7396ca90fa47d0458da4188061b24ca1bff465bf

Essentially, errors from removing cgroups were being ignored. Now they're taking a "fast fail" approach.

It's unclear at the moment what leads to the errors removing cgroups on bosh lites.

selzoc avatar Oct 31 '24 21:10 selzoc

Note that this only appears to be a problem with bosh lites using the warden cpi, not the docker cpi.

selzoc avatar Oct 31 '24 21:10 selzoc

FYI I've opened a PR for bosh-deployment to resolve this issue: https://github.com/cloudfoundry/bosh-deployment/pull/479

selzoc avatar Nov 19 '24 18:11 selzoc

Swinging back around to this, I unpinned bpm in one of our pipelines since we are pulling in this https://github.com/cloudfoundry/bosh-deployment/pull/479 change now.

My pipeline failed - random jobs failed to start on redeploys or monit stop / start operations.

It seems like the containerd_mode: false property is set but in some configurations jobs don't restart cleanly.

$ bosh env
Name               bosh-lite
UUID               45dc22bd-4972-459d-93ac-93a048e71e1b
Version            280.1.13 (00000000)
Director Stemcell  -/1.651
CPI                warden_cpi
Features           config_server: enabled
                   local_dns: enabled
                   snapshots: disabled
User               admin

$ bosh -n restart
...
Task 149 | 19:06:13 | L starting jobs: bpm/7792fd64-c12f-4034-b129-b10eed0a3946 (0) (canary) (00:02:48)
                    L Error: 'bpm/7792fd64-c12f-4034-b129-b10eed0a3946 (0)' is not running after update. Review logs for failed jobs: test-server, alt-test-server

$ bosh ssh
$ sudo -i
# bpm version
1.4.6
# bpm list
Name                        Pid Status
test-errand                 -   stopped
test-server                 -   failed
test-server.alt-test-server -   failed

# /var/vcap/packages/bpm/bin/runc --root /var/vcap/sys/run/bpm-runc/ delete bpm-test-server
ERRO[0002] unable to destroy container: unable to remove container's cgroup: Failed to remove paths: map[net_cls:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-test-server net_prio:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-test-server]

$ ssh -i /tmp/director.key jumpbox@${director_ip}
bosh/0:~$ sudo -i
bosh/0:~# bosh/0:~# grep -ri containerd /var/vcap/jobs/garden/monit
bosh/0:~# 

abg avatar Dec 11 '24 19:12 abg

Some minor updates: I was able to reproduce this with the latest bpm release (1.4.17)

It isn't actually necessary to make two deployments to reproduce it. You can get the same behavior by scaling the test job to 2 VMs.

AFAICT, the issue here is that the warden CPI mounts /sys/fs/cgroups into the "VM" container directly, which means that when BPM makes modifications under that directory on one VM, it will conflict with the modifications BPM is making on another VM. So deleting processes in BPM would lead to errors because the entries under cgroups are also in use by another container.

I am still not sure why we don't see the issue on the first restart, or the best way to address this.

julian-hj avatar Apr 07 '25 16:04 julian-hj

@abg this should be fixed in the 1.817 version of the jammy stemcell, as soon as that gets published. Can you retest and confirm?

julian-hj avatar Apr 15 '25 19:04 julian-hj

@abg we believe this was fixed by https://github.com/cloudfoundry/bosh-linux-stemcell-builder/pull/421, if you're still seeing this issue with the latest BPM release feel free to reopen the issue.

aramprice avatar Apr 29 '25 21:04 aramprice