bunkerweb icon indicating copy to clipboard operation
bunkerweb copied to clipboard

improve scheduler entrypoint

Open ppenguin opened this issue 2 years ago • 2 comments

Fixes poorly defined behaviour when the scheduler main process crashes.

The original code uses wait $pid to wait until the main.py process exits. This means that if /var/run/bunkerweb/scheduler.pid still exists after wait $pid returns, the contained pid therein is per definition invalid. I.e. if we should assume that a clean exit of main.py means the pid file should have been cleaned up, its mere existence at this point indicates a crash of the process.

This PR fixes the logic issues and makes crashes visible in the logs. It additionally tries to restart a crashed process after a delay.

Possibly usage of the /var/tmp/bunkerweb/scheduler.healthy file can also be removed (if its only function was to check for crashes), and the entrypoint could be made a bit leaner still.

Bonus info: the scheduler process crashed silently (in a podman container) when memory was limited to 100M, with >= 150M it appears to work.

ppenguin avatar Jan 11 '24 16:01 ppenguin

(I just noticed that the bw/entrypoint.sh suffers from the same issue. I.e. it can end up in an endless loop for an invalid pid file)

ppenguin avatar Jan 11 '24 20:01 ppenguin

... but the "fix" for the BW entrypoint doesn't seem to do the right thing: from the nginx main processes POV it is ok, but the problem seems to be that if BW restarts a crashed nginx process by itself rather than being restarted by the scheduler, the scheduler doesn't know it should then immediately initiate the config requests to the BW API and the new BW process is then only running in its default config.

So let's keep the scope of this PR limited to the scheduler's entrypoint, and solve the bw/entrypoint later.

ppenguin avatar Jan 12 '24 09:01 ppenguin