bosh-linux-stemcell-builder icon indicating copy to clipboard operation
bosh-linux-stemcell-builder copied to clipboard

Risk of false positive on systemd rc.local logics when execution of firstboot.sh fails

Open AJLfleos opened this issue 10 months ago • 2 comments

Existing implementation on systemd rc.local could have the risk of false positive if execution of firstboot.sh fails. This is because execution of firstboot.sh is in a new shell and it does not inherit the set -e option from rc.local. The rc.local still exits with status 0 even when execution of firstboot.sh fails. When it happens, there is no host key generated that then result in ssh failure on the provisioned VM.

Would it be feasible to add a retry logic on firstboot.sh execution failure in rc.local to mitigate the potential risk of false positive?

AJLfleos avatar Mar 27 '24 18:03 AJLfleos

have you already thought of a solution? or fixed this issue locally already?

retry logic within rc.local. seems to me at first glance that this could also cause issues. and have you seen how it would fail? error messages etc?

a retry logic would look something like this

if [ ! -e /root/firstboot_done ]; then
    if [ -e /root/firstboot.sh ]; then
        MAX_RETRIES=5
        COUNT=0
        while [ $COUNT -lt $MAX_RETRIES ]; do
            /root/firstboot.sh
            if [ $? -eq 0 ]; then
                break
            fi
            COUNT=$((COUNT+1))
        done
        if [ $COUNT -eq $MAX_RETRIES ]; then
            echo "Max retries reached. Exiting..."
            exit 1
        fi
    fi
    touch /root/firstboot_done
fi
exit 0

ramonskie avatar Apr 15 '24 13:04 ramonskie

The error, IIRC, looks like

 rc.local[791]: debconf: DbDriver "config": /var/cache/debconf/config.dat is locked by another process: Resource temporarily unavailable

so when this happens, there will be no host key generated from the rc.local.


have you already thought of a solution? or fixed this issue locally already?

Yes, I modified the rc.local with retry logics (pretty similar to what you posted above, but with an additional sleep in each loop). To my observation, with max of 5 times retry, the issue got mitigated.


retry logic within rc.local. seems to me at first glance that this could also cause issues.

Could you elaborate what would be the issues with retry logics? I'm not sure how often the host key generation can fail with current code without retry, but it seems to be rare since I don't see similar reports.

AJLfleos avatar Apr 15 '24 14:04 AJLfleos

we experienced this issue within noble and fixed it there with https://github.com/cloudfoundry/bosh-linux-stemcell-builder/commit/b4517f150f0f2cb7237138648f0c0c5c96ef7aa1 we can backport this to jammy

ramonskie avatar May 22 '24 10:05 ramonskie