bosh-linux-stemcell-builder
bosh-linux-stemcell-builder copied to clipboard
Risk of false positive on systemd rc.local logics when execution of firstboot.sh fails
Existing implementation on systemd rc.local could have the risk of false positive if execution of firstboot.sh
fails. This is because execution of firstboot.sh
is in a new shell and it does not inherit the set -e
option from rc.local
. The rc.local
still exits with status 0
even when execution of firstboot.sh
fails. When it happens, there is no host key generated that then result in ssh failure on the provisioned VM.
Would it be feasible to add a retry logic on firstboot.sh
execution failure in rc.local
to mitigate the potential risk of false positive?
have you already thought of a solution? or fixed this issue locally already?
retry logic within rc.local. seems to me at first glance that this could also cause issues. and have you seen how it would fail? error messages etc?
a retry logic would look something like this
if [ ! -e /root/firstboot_done ]; then
if [ -e /root/firstboot.sh ]; then
MAX_RETRIES=5
COUNT=0
while [ $COUNT -lt $MAX_RETRIES ]; do
/root/firstboot.sh
if [ $? -eq 0 ]; then
break
fi
COUNT=$((COUNT+1))
done
if [ $COUNT -eq $MAX_RETRIES ]; then
echo "Max retries reached. Exiting..."
exit 1
fi
fi
touch /root/firstboot_done
fi
exit 0
The error, IIRC, looks like
rc.local[791]: debconf: DbDriver "config": /var/cache/debconf/config.dat is locked by another process: Resource temporarily unavailable
so when this happens, there will be no host key generated from the rc.local.
have you already thought of a solution? or fixed this issue locally already?
Yes, I modified the rc.local
with retry logics (pretty similar to what you posted above, but with an additional sleep in each loop). To my observation, with max of 5 times retry, the issue got mitigated.
retry logic within rc.local. seems to me at first glance that this could also cause issues.
Could you elaborate what would be the issues with retry logics? I'm not sure how often the host key generation can fail with current code without retry, but it seems to be rare since I don't see similar reports.
we experienced this issue within noble and fixed it there with https://github.com/cloudfoundry/bosh-linux-stemcell-builder/commit/b4517f150f0f2cb7237138648f0c0c5c96ef7aa1 we can backport this to jammy