smee icon indicating copy to clipboard operation
smee copied to clipboard

Add retries to iPXE script when fetching files

Open mmlb opened this issue 4 years ago • 5 comments

The iPXE file fetches can run into temporary network issues when downloading the kernel/initramfs files. We should add some retry logic.

Expected Behaviour

Temporary network failures do not cause the iPXE boot to fail.

Current Behaviour

iPXE boot will fail if there's a network issue.

mmlb avatar Mar 25 '21 13:03 mmlb

Can someone identify which line in boots is responsible for the fetching that is missing retry?

It would help the person who might want to take this issue on.

tstromberg avatar Aug 27 '21 03:08 tstromberg

If I'm not mistaken, @mmlb is talking about this area in job/ipxe.go, which is in the function serveBootScript.

jacobweinstock avatar Sep 01 '21 10:09 jacobweinstock

Nope, its actually in the installers: https://github.com/tinkerbell/boots/blob/ba3a3fef424ebfd7125b08ae99dcb9631bc911a8/installers/osie/main.go#L54 We'd probably need a bounded loop that breaks on success. And similar for fetching the initrd and posting events back.

mmlb avatar Sep 02 '21 23:09 mmlb

If it helps, I've had good luck with using this library for exponential backoff+jitter, and have seen it used elsewhere in Tinkerbell (tink?): https://pkg.go.dev/github.com/cenkalti/backoff/v4

tstromberg avatar Sep 02 '21 23:09 tstromberg

The http request is not from boot's side. Its ipxe doing the fetches we'd need to do the retries/backoff in the iPXE script.

mmlb avatar Sep 03 '21 17:09 mmlb