tectonic-installer
tectonic-installer copied to clipboard
Downloads of assets during bootkube doesn't restart properly
If the Internet is choppy during a download the system will just hang and not resume properly: core@controller1 ~ $ sudo systemctl stop bootkube core@controller1 ~ $ journalctl -u bootkube -f -- Logs begin at Thu 2017-05-18 14:10:26 UTC. -- May 18 14:10:52 controller1.dev-env.local bash[1413]: Downloading ACI: 68.2 KB/18.1 MB May 18 14:10:53 controller1.dev-env.local bash[1413]: Downloading ACI: 138 KB/18.1 MB May 18 14:10:54 controller1.dev-env.local bash[1413]: Downloading ACI: 207 KB/18.1 MB May 18 14:11:12 controller1.dev-env.local bash[1413]: Downloading ACI: 242 KB/18.1 MB May 18 14:11:13 controller1.dev-env.local bash[1413]: Downloading ACI: 277 KB/18.1 MB May 18 14:11:15 controller1.dev-env.local bash[1413]: Downloading ACI: 364 KB/18.1 MB May 18 14:50:13 controller1.dev-env.local systemd[1]: bootkube.service: Main process exited, code=killed, status=15/TERM May 18 14:50:13 controller1.dev-env.local systemd[1]: Stopped Bootstrap a Kubernetes cluster. May 18 14:50:13 controller1.dev-env.local systemd[1]: bootkube.service: Unit entered failed state. May 18 14:50:13 controller1.dev-env.local systemd[1]: bootkube.service: Failed with result 'signal'. ^C
Expected behavior would be not hanging and trying to resume downloads.
This download happens using rkt. As far as I know resumable downloads are not supported in the docker2aci library, /cc'ing @lucab to ensure/verify and also to brainstorm if this something we should tackle in rkt or in the calling systemd service unit.
I'm lacking some details here, so just some quick observations:
- this seems to be an actual ACI, so not going through docker2aci. On the other hand, still I think rkt doesn't support resumption on aci, and thas also depends on the remote supporting chunking.
- ~40mins for 18MB is a bit more than a choppy internet :smile: On a serious note, I'm wondering why retransmission didn't kick in, and why didn't the whole downloading process timed out at some point.
- tectonic may want to pre-pull assets in dedicated units with reasonable timeouts to gracefully handle such pathological network cases.
Earlier pre-pull sounds easy to achieve.
Late followup on this: a better behavior here would be to having a fail-restarting bootkube unit. However that unit is a oneshot
service which doesn't support restarts. Changing this to a type simple
service would work, but there are further issues about the bootkube process itself not being restartable.