bucc
bucc copied to clipboard
bucc vm does not survive a restart
Hi,
we still use 0.92 on openstack. I observed that after a restart of the bucc vm e.g. be bucc ssh -> shutdown -r now the vm is not coming up again: It is rebooted but no monit process running, the persistent disk seems not to be mounted properly, see below. Any ideas? Is it a stemcell problem?
bosh/0:/var/vcap/bosh/bin# monit summary
/var/vcap/monit/job/0024_nats.monitrc:3: Warning: the executable does not exist '/var/vcap/jobs/bpm/bin/bpm'
/var/vcap/monit/job/0024_nats.monitrc:4: Warning: the executable does not exist '/var/vcap/jobs/bpm/bin/bpm'
/var/vcap/monit/job/0023_postgres.monitrc:3: Warning: the executable does not exist '/var/vcap/jobs/postgres/bin/postgres_ctl'
/var/vcap/monit/job/0023_postgres.monitrc:5: Warning: the executable does not exist '/var/vcap/jobs/postgres/bin/postgres_ctl'
[...]
/bosh_dns_resolvconf_ctl'
/var/vcap/monit/job/0001_director-bosh-dns.monitrc:3: Warning: the executable does not exist '/var/vcap/jobs/bpm/bin/bpm'
/var/vcap/monit/job/0001_director-bosh-dns.monitrc:4: Warning: the executable does not exist '/var/vcap/jobs/bpm/bin/bpm'
monit: no status available -- the monit daemon is not running
i only know of this problem in combination with virtualbox cpi it could have gone wrong because of several reasons. and is this a one time occurrence or everytime?
if the disk still exists thats in the state file ./state/state.json
than you should be able to just do a bucc up
.
Yes it is reproducible. A bucc up doesn't help because it detects no change and will not act. bucc up --recreate will recreate the vm and everything works fine again.
This issue also occurs on vSphere. On reboot, /var/vcap/store and /var/vcap/data are not mounted. Workaround: execute bucc up with the --recreate flag.
I think we have the same problem with bucc up --lite --cpi=docker-desktop. When i restart the bosh instance in docker.. all the https request does not work
this is a cpi issues unfortunately. nothing much we can do about it from a bucc perspective. we can try to fix it in the cpi and make a pr there. if anyone is up for that?
@ramonskie can you explain this in more detail? If I got you correctly, this issue occurs in at least the openstack, docker, vsphere and virtualbox cpi.
i have not seen this issue occurring in vsphere only on docker/virtualbox. and thats due to how the disks are mounted via those specific cpi's in combination with the bosh agent.
see this long standing open issue https://github.com/cppforlife/bosh-virtualbox-cpi-release/issues/7 so in order to fix this. someone should fix those issues in the cpi/agent. unfortunately we cannot ducktape a fix in bucc in this case. the only thing we can do is either let the bosh team know and let them prioritize the work. or fix it and make a pr to bosh
Well, we are facing this issue with the vSphere CPI and @damzog, who opened this issue, uses the openstack cpi. That's why I'm asking. For me it sounds like it's not only a bug with the docker/virtualbox CPI but with some other component. :(
is it reproducible? have you already done some preliminary work of debugging this issue? as we are testing it on vsphere with full upgrade scenarios etc and have not seen these kind of errors yet
Yes, I can reproduce this behaviour, just did it. We noticed this issue while performing some failover tests (e.g., vSphere HA moving and restarting the VM) but it's also reproducable by simply rebooting the bucc VM via vSphere GUI or by using govc vm.power -r=true
As far as we know, updating bucc is not affected by this issue.