tectonic-installer icon indicating copy to clipboard operation
tectonic-installer copied to clipboard

Downloads of assets during bootkube doesn't restart properly

Open xgerman opened this issue 7 years ago • 4 comments

If the Internet is choppy during a download the system will just hang and not resume properly: core@controller1 ~ $ sudo systemctl stop bootkube core@controller1 ~ $ journalctl -u bootkube -f -- Logs begin at Thu 2017-05-18 14:10:26 UTC. -- May 18 14:10:52 controller1.dev-env.local bash[1413]: Downloading ACI: 68.2 KB/18.1 MB May 18 14:10:53 controller1.dev-env.local bash[1413]: Downloading ACI: 138 KB/18.1 MB May 18 14:10:54 controller1.dev-env.local bash[1413]: Downloading ACI: 207 KB/18.1 MB May 18 14:11:12 controller1.dev-env.local bash[1413]: Downloading ACI: 242 KB/18.1 MB May 18 14:11:13 controller1.dev-env.local bash[1413]: Downloading ACI: 277 KB/18.1 MB May 18 14:11:15 controller1.dev-env.local bash[1413]: Downloading ACI: 364 KB/18.1 MB May 18 14:50:13 controller1.dev-env.local systemd[1]: bootkube.service: Main process exited, code=killed, status=15/TERM May 18 14:50:13 controller1.dev-env.local systemd[1]: Stopped Bootstrap a Kubernetes cluster. May 18 14:50:13 controller1.dev-env.local systemd[1]: bootkube.service: Unit entered failed state. May 18 14:50:13 controller1.dev-env.local systemd[1]: bootkube.service: Failed with result 'signal'. ^C

Expected behavior would be not hanging and trying to resume downloads.

xgerman avatar May 18 '17 15:05 xgerman

This download happens using rkt. As far as I know resumable downloads are not supported in the docker2aci library, /cc'ing @lucab to ensure/verify and also to brainstorm if this something we should tackle in rkt or in the calling systemd service unit.

s-urbaniak avatar May 19 '17 14:05 s-urbaniak

I'm lacking some details here, so just some quick observations:

  • this seems to be an actual ACI, so not going through docker2aci. On the other hand, still I think rkt doesn't support resumption on aci, and thas also depends on the remote supporting chunking.
  • ~40mins for 18MB is a bit more than a choppy internet :smile: On a serious note, I'm wondering why retransmission didn't kick in, and why didn't the whole downloading process timed out at some point.
  • tectonic may want to pre-pull assets in dedicated units with reasonable timeouts to gracefully handle such pathological network cases.

lucab avatar May 19 '17 14:05 lucab

Earlier pre-pull sounds easy to achieve.

alexsomesan avatar May 19 '17 15:05 alexsomesan

Late followup on this: a better behavior here would be to having a fail-restarting bootkube unit. However that unit is a oneshot service which doesn't support restarts. Changing this to a type simple service would work, but there are further issues about the bootkube process itself not being restartable.

lucab avatar Aug 30 '17 11:08 lucab