bucc bucc vm does not survive a restart

Hi,

we still use 0.92 on openstack. I observed that after a restart of the bucc vm e.g. be bucc ssh -> shutdown -r now the vm is not coming up again: It is rebooted but no monit process running, the persistent disk seems not to be mounted properly, see below. Any ideas? Is it a stemcell problem?

bosh/0:/var/vcap/bosh/bin# monit summary
/var/vcap/monit/job/0024_nats.monitrc:3: Warning: the executable does not exist '/var/vcap/jobs/bpm/bin/bpm'
/var/vcap/monit/job/0024_nats.monitrc:4: Warning: the executable does not exist '/var/vcap/jobs/bpm/bin/bpm'
/var/vcap/monit/job/0023_postgres.monitrc:3: Warning: the executable does not exist '/var/vcap/jobs/postgres/bin/postgres_ctl'
/var/vcap/monit/job/0023_postgres.monitrc:5: Warning: the executable does not exist '/var/vcap/jobs/postgres/bin/postgres_ctl'
[...]
/bosh_dns_resolvconf_ctl'
/var/vcap/monit/job/0001_director-bosh-dns.monitrc:3: Warning: the executable does not exist '/var/vcap/jobs/bpm/bin/bpm'
/var/vcap/monit/job/0001_director-bosh-dns.monitrc:4: Warning: the executable does not exist '/var/vcap/jobs/bpm/bin/bpm'
monit: no status available -- the monit daemon is not running

Aug 14 '20 11:08 damzog

i only know of this problem in combination with virtualbox cpi it could have gone wrong because of several reasons. and is this a one time occurrence or everytime?

if the disk still exists thats in the state file ./state/state.json than you should be able to just do a bucc up.

Aug 14 '20 11:08 ramonskie

Yes it is reproducible. A bucc up doesn't help because it detects no change and will not act. bucc up --recreate will recreate the vm and everything works fine again.

Aug 14 '20 15:08 damzog

This issue also occurs on vSphere. On reboot, /var/vcap/store and /var/vcap/data are not mounted. Workaround: execute bucc up with the --recreate flag.

Sep 24 '20 14:09 owwweiha

I think we have the same problem with bucc up --lite --cpi=docker-desktop. When i restart the bosh instance in docker.. all the https request does not work

Jan 19 '21 17:01 chewfred

this is a cpi issues unfortunately. nothing much we can do about it from a bucc perspective. we can try to fix it in the cpi and make a pr there. if anyone is up for that?

Jan 20 '21 09:01 ramonskie

@ramonskie can you explain this in more detail? If I got you correctly, this issue occurs in at least the openstack, docker, vsphere and virtualbox cpi.

May 31 '21 11:05 owwweiha

i have not seen this issue occurring in vsphere only on docker/virtualbox. and thats due to how the disks are mounted via those specific cpi's in combination with the bosh agent.

see this long standing open issue https://github.com/cppforlife/bosh-virtualbox-cpi-release/issues/7 so in order to fix this. someone should fix those issues in the cpi/agent. unfortunately we cannot ducktape a fix in bucc in this case. the only thing we can do is either let the bosh team know and let them prioritize the work. or fix it and make a pr to bosh

May 31 '21 11:05 ramonskie

Well, we are facing this issue with the vSphere CPI and @damzog, who opened this issue, uses the openstack cpi. That's why I'm asking. For me it sounds like it's not only a bug with the docker/virtualbox CPI but with some other component. :(

May 31 '21 12:05 owwweiha

is it reproducible? have you already done some preliminary work of debugging this issue? as we are testing it on vsphere with full upgrade scenarios etc and have not seen these kind of errors yet

May 31 '21 12:05 ramonskie

Yes, I can reproduce this behaviour, just did it. We noticed this issue while performing some failover tests (e.g., vSphere HA moving and restarting the VM) but it's also reproducable by simply rebooting the bucc VM via vSphere GUI or by using govc vm.power -r=true As far as we know, updating bucc is not affected by this issue.

May 31 '21 13:05 owwweiha

bucc bucc copied to clipboard

bucc vm does not survive a restart

bucc
bucc copied to clipboard