smee
smee copied to clipboard
Add retries to iPXE script when fetching files
The iPXE file fetches can run into temporary network issues when downloading the kernel/initramfs files. We should add some retry logic.
Expected Behaviour
Temporary network failures do not cause the iPXE boot to fail.
Current Behaviour
iPXE boot will fail if there's a network issue.
Can someone identify which line in boots is responsible for the fetching that is missing retry?
It would help the person who might want to take this issue on.
If I'm not mistaken, @mmlb is talking about this area in job/ipxe.go, which is in the function serveBootScript.
Nope, its actually in the installers: https://github.com/tinkerbell/boots/blob/ba3a3fef424ebfd7125b08ae99dcb9631bc911a8/installers/osie/main.go#L54 We'd probably need a bounded loop that breaks on success. And similar for fetching the initrd and posting events back.
If it helps, I've had good luck with using this library for exponential backoff+jitter, and have seen it used elsewhere in Tinkerbell (tink?): https://pkg.go.dev/github.com/cenkalti/backoff/v4
The http request is not from boot's side. Its ipxe doing the fetches we'd need to do the retries/backoff in the iPXE script.