lychee icon indicating copy to clipboard operation
lychee copied to clipboard

feat: implement per-host rate limiting and statistics

Open thomas-zahner opened this issue 1 month ago • 1 comments

Supersedes #1844. Rebased master to resolve conflicts. Since there were many conflicts and I spent some time resolving them I didn't force push to the existing branch in case I messed something up. (it seems like I didn't)

Things to be done:

  • [x] Don't return Client from check & update Arcs (https://github.com/lycheeverse/lychee/pull/1844/files#r2360356264)
  • [x] CLI test for host-specific headers
  • [x] Fix the behavior when setting the request interval to 0
  • [ ] Do through manual testing and potentially reconsider new default settings
  • [ ] Maybe CLI test to verify expected timing with different host concurrencies & request intervals?

Fixes #1605 - Add Per-Host Rate Limiting and Caching
Fixes #989 - Add custom delay between requests (prevent ban)
Fixes #1298 - restrict custom HTTP request headers to specific URL patterns
Addresses #367 - 429 Too Many Requests
Addresses #1593 - The cache is ineffective with the default concurrency Addresses #1815 - All duplicates should get removed, and no link ever gets checked more than once

thomas-zahner avatar Nov 21 '25 13:11 thomas-zahner

@mre I cleaned up the PR quite a bit. I think the PR is almost ready to be merged. I'm just considering some additional testing and changing some defaults, and maybe address the errors when request_interval is set to 0. If you have time, could you review my additions?

Removed things

While cleaning up I encountered a few things which I didn't understand why you added them. I've removed them to test and see if it made a difference. I think we're better of without them, unless you correct me.

I've removed cache_max_age, see 47d8b65d. As explained in the commit message I don't see why it is needed. I've understood the cache to be only host-specific and in memory. It has nothing to do with the lychee cache exposed to users. (.lycheecache) Why would we need cache expiration and why would we set up a expiration time of one hour, when a link check run normally takes a few seconds/minutes?

In c9578b9a I've removed max_concurrency and global_semaphore. As explained in the message I didn't see any use of them either. Did I miss something?

See 727f2092 where I replace Window with Vec. Was there a good reason for using a Window? IMO it only distorts the statistics. Memory shouldn't be an issue, unless we had millions of requests, Duration is 16 bytes.

thomas-zahner avatar Dec 10 '25 12:12 thomas-zahner

I've removed cache_max_age, see 47d8b65. As explained in the commit message I don't see why it is needed.

You're right. Makes no sense. Similar to limiting the window-size, I was a bit worried about memory usage for long-running processes. My thought was that we could limit the in-memory cache in case link checking takes a long time. This might become relevant once we add recursion support, but even then I'm not 100% sure. So it's unlikely that this causes a problem, and I'm thankful that it got removed.

In c9578b9 I've removed max_concurrency and global_semaphore. As explained in the message I didn't see any use of them either. Did I miss something?

The global semaphore was to enforce an overall concurrency limit, but I somehow missed that our mpsc channel already limits that. So that's unnecessary. Thanks for the cleanup.

See 727f209 where I replace Window with Vec. Was there a good reason for using a Window? IMO it only distorts the statistics. Memory shouldn't be an issue, unless we had millions of requests, Duration is 16 bytes.

Yeah, I was a bit worried about the memory usage, actually, in the case where we handle millions of requests (which is unlikely but not impossible). But it's probably a case of premature optimization and before that part becomes a problem, other issues might arise before (such as input handling). So it's fine to use a Vec for consistency.

mre avatar Dec 16 '25 11:12 mre

Great work!

Can we set a different default for GitHub? As per their docs, they allow up to 100 concurrent requests and many of our users use lychee to check GitHub links specifically, which would reduce friction as it closely matches the prior behavior.

If we decide to change GitHub's default, we should mention that in the README somewhere and maybe in the extended help text.

mre avatar Dec 16 '25 11:12 mre

One more note: have we checked that caching still works? E.g. do the links end up in the cache file at the end?

mre avatar Dec 16 '25 12:12 mre

One more note: have we checked that caching still works? E.g. do the links end up in the cache file at the end?

Did a quick manual test and seems to work as expected. Also this should be covered by integration test in cli.rs.

thomas-zahner avatar Dec 18 '25 17:12 thomas-zahner

I just did some small adjustments which I think are improvements. Do you agree with the changes? If so, I would now like to merge this PR. As part of fc3023cd I've also adjusted the default request interval value because I think it was a bit too pessimistic, especially considering that we previously had none and we now also respect rate limit headers.

Can we set a different default for GitHub? As per their docs, they allow up to 100 concurrent requests and many of our users use lychee to check GitHub links specifically, which would reduce friction as it closely matches the prior behavior.

I would like to do this separately, as it might still be some effort. For this I've created #1960.

thomas-zahner avatar Dec 19 '25 16:12 thomas-zahner

Thanks for the high-quality additions and for following through. This is a massive step forward and I'm happy to finally see it land. A lot of our current issues will go away once merged. 😊 ⭐

Thank you for the core work! Totally agree that is quite a milestone for lychee. I did some tests where I compared lychee v0.21.0 with this branch and the link process was more than twice as fast. (so intentional slower requests leads to an overall speedup) Will do some follow up testing and this might be worth a report and blog post.

thomas-zahner avatar Dec 21 '25 17:12 thomas-zahner