lychee Retry policy of network errors

Hello! In our CI we've been getting transient network errors like the following:

[ERROR] https://toml.io/en/ | Network error: Connection reset by server. Server forcibly closed connection

That error message is explicitly transformed here.

The docs page for network errors doesn't really discuss anything except how to reproduce and with an emphasis on certificate issues. I can't tell whether retries occur at this level or just for HTTP rate limits (which I see is being further improved in https://github.com/lycheeverse/lychee/pull/1844).

Oct 26 '25 21:10 ofek

Can you reproduce it locally? It works for me.

echo 'https://toml.io/en/' | lychee -
🔍 1 Total (in 0s) ✅ 1 OK 🚫 0 Errors

Without being able to reproduce it, it will be very hard to troubleshoot. Ideally, we have to find a way to reproduce it in curl or another tool on top of that. This way, we can tell if it's a server issue or a client issue.

Oct 29 '25 20:10 mre

I most certainly wouldn't be able to because it's a flake (that just so happens to manifest more frequently than expected).

I think this specific URL is unrelated to the issue which is mostly about documenting when retries occur.

Oct 29 '25 21:10 ofek

What do you mean by "documenting when retries occur"? lychee uses a very basic retry-mechanism. It tries up to MAX_RETRIES times per request, where the default is 3 retries. https://github.com/lycheeverse/lychee/blob/d85ed9e6a9d2976af701b3efdf4a0c0483ecac70/lychee-lib/src/client.rs#L42

We do some exponential backoff between each retry. The code is here

https://github.com/lycheeverse/lychee/blob/d85ed9e6a9d2976af701b3efdf4a0c0483ecac70/lychee-lib/src/checker/website.rs#L88-L105

The code, which decides if we should retry a request, is here. There are no other conditions.

I don't know if and how we should document this. Open for suggestions / pull requests. But keep in mind that we have to keep the documentation in sync with the code, which is not always easy.

Oct 30 '25 15:10 mre

Sorry about that, let me be more explicit! What I'm trying to figure out specifically is what types of errors are retried e.g. HTTP status codes, certificate issues errors, connection issues, etc.

Oct 30 '25 16:10 ofek

Sure. I tried to summarize the current behavior as a table:

Error Type	Retried?	Examples
5xx Server Errors	✅ Yes	500, 502, 503, 504
408 Request Timeout	✅ Yes	Request took too long
429 Too Many Requests	✅ Yes	Rate limit exceeded
Connection Timeout	✅ Yes	Server didn't respond in time
Connection Reset	✅ Yes	Connection dropped unexpectedly
Connection Aborted	✅ Yes	Connection terminated mid-request
Incomplete Message	✅ Yes	Response cut off before completion
4xx Client Errors	❌ No	400, 401, 403, 404 (except 408, 429)
2xx Success	❌ No	200, 201, 204
3xx Redirects	❌ No	301, 302
Initial Connection Failure	❌ No	Can't reach server at all
Certificate Issues	❌ No	SSL/TLS errors
Invalid Request Body	❌ No	Malformed data
Decoding Errors	❌ No	Can't parse response
Redirect Errors	❌ No	Too many redirects, etc.

As a rule of thumb:

Retries: Temporary problems (server down, network hiccup, timeout)
No Retry: Permanent problems (bad request, auth failure, not found)

Does this answer your question?

Nov 01 '25 17:11 mre

That's super comprehensive, thanks! What you think about adding that to the docs?

Based on your table it appears like the connection reset error we intermittently experience would have been retried. If we were to increase the number of retries would there be a fixed wait between each or is there exponential backoff?

Nov 02 '25 01:11 ofek

Glad you liked it. If you like, you can create a pull request to add the table to the docs. The repo is here: https://github.com/lycheeverse/lycheeverse.github.io I don't know what would be the perfect the place to add it yet.

And yes, there's an exponential backoff between all retries.

Nov 03 '25 12:11 mre