zig icon indicating copy to clipboard operation
zig copied to clipboard

Implement package fetch retries

Open joshua-holmes opened this issue 3 months ago • 9 comments

UPDATE

See here for most recent list of features in this solution: https://github.com/ziglang/zig/pull/25120#issuecomment-3448003430

Problem

Like in #17472, the package manager does not retry fetching when communication to the server fails in a way that could possibly be resolved by retrying the fetch.

Solution

Automatically retry the fetch in these scenarios during http fetches:

  1. Connecting to the server fails
  2. Sending the payload to the server fails
  3. Server returns a non-200 status code

and theses scenarios during git+http fetches:

  1. Discovering remote git server capabilities fails
  2. Creating a fetch stream fails

This does not expose this fetch config to the end user, for reasons you can read about in the next section. Definitely worth the conversation, but getting this PR out there as is doesn't seem like a bad idea to me.

Major Considerations

  • I thought about how I can expose configuration of the retry behavior to the user (the retry_count and retry_delay_ms). However, this proved difficult, so it is not a part of this PR. Here are some thoughts I had though:
    • I cannot expose config in the build.zig file, because (as far as I can tell) the build.zig file is not run until all the packages are fetched and ready to be built with the source code.
    • I do not want to include this config for each dep (dependency) because - how do we handle this for deps of deps? We don't want the lib maintainer to determine how many times I fetch the lib's deps. That's the end user's decision. So deps of deps can inherit the config, but then the end user isn't getting the control they are "promised" by having the config attached to each dep.
    • I do not want to include the config in the body of the manifest. Feels out of place, since that body isn't supposed to be about fetch config for dependencies, it's supposed to be about defining the current package along with pointing to it's dependent packages.
    • Maybe we allow config to be passed in the body of .dependencies, like this (below) so that .retry_count and other props apply to all dependencies, but it's still a part of the .dependencies object. Also, this can easily be parsed by checking the type of the prop. If the type is anytype, it's a real dep, if it's anything else, it's config code, not a real dep. Fun idea, but it's not a part of this PR and shouldn't be. Here is an example of this idea:
      .dependencies = .{
          .retry_count = 3,
          .retry_delay_ms = 1000,
          .zf = .{
              .url = "https://github.com/natecraddock/zf/archive/7aacbe6d155d64d15937ca95ca6c014905eb531f.tar.gz",
              .hash = "zf-0.10.3-OIRy8aiIAACLrBllz0zjxaH0aOe5oNm3KtEMyCntST-9",
              .lazy = true,
          },
          .vaxis = .{
              .url = "git+https://github.com/rockorager/libvaxis#1f41c121e8fc153d9ce8c6eb64b2bbab68ad7d23",
              .hash = "vaxis-0.1.0-BWNV_FUICQAFZnTCL11TUvnUr1Y0_ZdqtXHhd51d76Rn",
              .lazy = true,
          },
      },
      
  • Would be cool to show the retry progress like this after it fails the first time. But it might be worth extending std.Progress.Node first for that, so this is also not implemented in this PR:
    Compile Build Script
    └─ [1/2] Fetch Packages
       ├─ vaxis (retry 1/2)
       └─ zf
    

Minor Considerations

  • I intentionally did not attempt to retry during failures that do not seem resolvable by simply refetching, e.g. when receiving the http header because, looking through the possible errors, it seems errors mostly occur because of invalid responses, redirecting, or write failures in that case.
  • There are some locations where memory is being freed that wasn't being freed before. This is because, in the case of the user possibly configuring Fetch to retry many times, I don't want the allocator to allocate without freeing each time it attempts to refetch the network data, resulting in a memory leak.
  • The Fetch.initResource() method needs to be refactored. But this PR is big enough, so it made sense to me to merge this, then do a refactor in a future PR.

joshua-holmes avatar Sep 03 '25 00:09 joshua-holmes

It might be useful to look at the defaults of other package managers for ideas - the cargo source has some good info about retries, including links to other resources: https://doc.rust-lang.org/nightly/nightly-rustc/src/cargo/util/network/retry.rs.html

Some notes:

  • Default retries is 3 (total of 4 attempts)
  • For HTTP failures, try parsing the "Retry-After" header, which could be a number of seconds or an HTTP date string referring to a date in the future
  • Otherwise, first delay is a randomized value between 500ms and 1500ms
  • Subsequent delays are linear backoff 3500ms, 6500ms, etc up to a max of 10 seconds
  • Classification of spurious (retryable) errors in maybe_spurious:
    • Various network-level errors like DNS failure or TCP errors or network timeouts
    • HTTP 5xx or HTTP 429

ehaas avatar Sep 04 '25 16:09 ehaas

@ehaas Yeah, I also took a look at the uv python package manager and found very similar behavior. I created this PR with the intention of keeping it really simple to start with and adding complexity if requested. Better that than the reverse, and I'm new here so I don't know where the tolerances are.

I like that list of notes you jotted down though, so I'll start there.

joshua-holmes avatar Sep 05 '25 01:09 joshua-holmes

Has anyone heard of a need/desire for the Zig HTTP client to have retry logic built into it? If not, I'll continue implementing the retry logic in the Fetch.zig file to avoid abstracting unnecessarily, but if this is a desire, I can implement it at a lower level in std.http.Client.zig.

joshua-holmes avatar Sep 13 '25 22:09 joshua-holmes

These are the errors I have decided are spurious:

fn maybeSpurious(err: anyerror) bool {
    return switch (err) {
        // tcp errors
        std.http.Client.ConnectTcpError.Timeout,
        std.http.Client.ConnectTcpError.ConnectionResetByPeer,
        std.http.Client.ConnectTcpError.ConnectionRefused,
        std.http.Client.ConnectTcpError.NetworkUnreachable,
        std.http.Client.ConnectTcpError.AddressInUse,
        std.http.Client.ConnectTcpError.NameServerFailure,
        std.http.Client.ConnectTcpError.SystemResources,

        // io errors
        std.http.Client.Request.ReceiveHeadError.WriteFailed,
        std.http.Client.Request.ReceiveHeadError.ReadFailed,

        // http errors
        std.http.Client.Request.ReceiveHeadError.HttpRequestTruncated,
        std.http.Client.Request.ReceiveHeadError.HttpConnectionClosing,
        std.http.Client.Request.ReceiveHeadError.HttpChunkTruncated,
        std.http.Client.Request.ReceiveHeadError.HttpChunkInvalid,
        BadHttpStatus.MaybeSpurious,

        Allocator.Error.OutOfMemory,
        => true,
        else => false,
    };
}

I mostly used gitoxide (used by cargo) cargo as a reference, plus my own judgement.

Many of the TCP errors were pretty easy to map from gitoxide, and the ones gitoxide chose were spurious made sense to me.

The HTTP errors, on the other hand, did not map well from cargo, which is really curl under the hood. I found several of the errors that curl had implemented don't map well to the way we do things. We just have a different implementation of things, which results in different things that can go wrong. So I derived what I could from cargo + curl, and used my best judgement for the rest.

As the comment in the code above shows, the only thing I need to investigate is how to detect an Zig equivalent of the curl EOF error.

I'm working on this as often as I can, but sometimes that's as infrequently as once per week, so it's hard for me to move fast. Plus I'm learning a lot about low-level networking, which doesn't help. That's all for now

joshua-holmes avatar Sep 21 '25 02:09 joshua-holmes

I still need to test it, but jitter is implemented, spurious error checks are implemented, and it does compile. The only thing I need to implement now is reading the "Retry-After" header

joshua-holmes avatar Oct 05 '25 01:10 joshua-holmes

I implemented reading and parsing the Retry-After header! Some notes:

  • I used the MDN Retry-After docs to inform me on how to parse this.
  • The Retry-After header can return either a positive integer indicating how many seconds to wait, or a timestamp indicating a point in time that the client can try again. I used the MDN Date docs to inform me on how to parse the date. If parsing fails, we skip reading it.
  • If Retry-After gives us a really long wait time, we cap it at our current max delay, which is 10 sec. Same as Cargo does.

All I need to do now is figure out how to unit test the Fetch.zig changes. It's clear that there isn't a whole lot of testing in that file currently, probably because mocking network calls is tricky. I'll do my best, but if I can't, I'll just mock network calls in a bash script and test locally, even if it's not part of the test suite. Then I should be done and ready for review!

joshua-holmes avatar Oct 19 '25 05:10 joshua-holmes

Ok it's ready for review again. Here are the key features, which is very similar to the list of features @ehaas commented earlier.

Features

  • Default retries is 3 (total of 4 attempts)
    • For some HTTP failures (500+, 429, 404), try parsing the "Retry-After" header, which could be a number of seconds or an HTTP date string referring to a date in the future. See my above comment for how I implemented that.
      • Adheres to our 10 sec max, so we will never wait longer than that, even if "Retry-After" tells us to, same as Cargo.
      • If "Retry-After" is set to a point in time in the past, we ignore the header and calculate our delay (with jitter) as normal.
    • Otherwise, first delay is a randomized value between 500ms and 1500ms
    • Subsequent delays are linear backoff 3000ms, 6000ms, etc up to a max of 10 seconds. Note that Cargo uses 3500 and 6500, etc instead. I just felt this was simpler.
  • Classification of spurious (retryable) errors can be shown in the code linked here. This is a combination of what Cargo and uv thinks is spurious, plus my common sense. HTTP status codes 500+, 429, and 404 are considered spurious. Others are not.
  • Extended lib/std/time/epoch.zig, which was necessary to convert date to epoch seconds

Testing

I unit tested quite a bit of the new code, and the tests pass. I also tested this locally with and without network failures, while logging before and after the delays to see that the number matched with real delay. I also tested this with a local endpoint that includes a "Retry-After" header. All works really well.

joshua-holmes avatar Oct 26 '25 04:10 joshua-holmes

Marking as draft until I can rebase

joshua-holmes avatar Oct 29 '25 23:10 joshua-holmes

Rebase completed @linusg, thanks for waiting

joshua-holmes avatar Nov 08 '25 23:11 joshua-holmes