Retry policy of network errors
Hello! In our CI we've been getting transient network errors like the following:
[ERROR] https://toml.io/en/ | Network error: Connection reset by server. Server forcibly closed connection
That error message is explicitly transformed here.
The docs page for network errors doesn't really discuss anything except how to reproduce and with an emphasis on certificate issues. I can't tell whether retries occur at this level or just for HTTP rate limits (which I see is being further improved in https://github.com/lycheeverse/lychee/pull/1844).
Can you reproduce it locally? It works for me.
echo 'https://toml.io/en/' | lychee -
🔍 1 Total (in 0s) ✅ 1 OK 🚫 0 Errors
Without being able to reproduce it, it will be very hard to troubleshoot. Ideally, we have to find a way to reproduce it in curl or another tool on top of that. This way, we can tell if it's a server issue or a client issue.
I most certainly wouldn't be able to because it's a flake (that just so happens to manifest more frequently than expected).
I think this specific URL is unrelated to the issue which is mostly about documenting when retries occur.
What do you mean by "documenting when retries occur"? lychee uses a very basic retry-mechanism. It tries up to MAX_RETRIES times per request, where the default is 3 retries. https://github.com/lycheeverse/lychee/blob/d85ed9e6a9d2976af701b3efdf4a0c0483ecac70/lychee-lib/src/client.rs#L42
We do some exponential backoff between each retry. The code is here
https://github.com/lycheeverse/lychee/blob/d85ed9e6a9d2976af701b3efdf4a0c0483ecac70/lychee-lib/src/checker/website.rs#L88-L105
The code, which decides if we should retry a request, is here. There are no other conditions.
I don't know if and how we should document this. Open for suggestions / pull requests. But keep in mind that we have to keep the documentation in sync with the code, which is not always easy.
Sorry about that, let me be more explicit! What I'm trying to figure out specifically is what types of errors are retried e.g. HTTP status codes, certificate issues errors, connection issues, etc.
Sure. I tried to summarize the current behavior as a table:
| Error Type | Retried? | Examples |
|---|---|---|
| 5xx Server Errors | ✅ Yes | 500, 502, 503, 504 |
| 408 Request Timeout | ✅ Yes | Request took too long |
| 429 Too Many Requests | ✅ Yes | Rate limit exceeded |
| Connection Timeout | ✅ Yes | Server didn't respond in time |
| Connection Reset | ✅ Yes | Connection dropped unexpectedly |
| Connection Aborted | ✅ Yes | Connection terminated mid-request |
| Incomplete Message | ✅ Yes | Response cut off before completion |
| 4xx Client Errors | ❌ No | 400, 401, 403, 404 (except 408, 429) |
| 2xx Success | ❌ No | 200, 201, 204 |
| 3xx Redirects | ❌ No | 301, 302 |
| Initial Connection Failure | ❌ No | Can't reach server at all |
| Certificate Issues | ❌ No | SSL/TLS errors |
| Invalid Request Body | ❌ No | Malformed data |
| Decoding Errors | ❌ No | Can't parse response |
| Redirect Errors | ❌ No | Too many redirects, etc. |
As a rule of thumb:
- Retries: Temporary problems (server down, network hiccup, timeout)
- No Retry: Permanent problems (bad request, auth failure, not found)
Does this answer your question?
That's super comprehensive, thanks! What you think about adding that to the docs?
Based on your table it appears like the connection reset error we intermittently experience would have been retried. If we were to increase the number of retries would there be a fixed wait between each or is there exponential backoff?
Glad you liked it. If you like, you can create a pull request to add the table to the docs. The repo is here: https://github.com/lycheeverse/lycheeverse.github.io I don't know what would be the perfect the place to add it yet.
And yes, there's an exponential backoff between all retries.