lychee icon indicating copy to clipboard operation
lychee copied to clipboard

Duplicate pings even with --cache

Open khemarato opened this issue 2 years ago • 3 comments

It seems that a given URL may be pinged simultaneously on different threads, leading to a duplicate ping even with --cache

khemarato avatar Jul 27 '22 09:07 khemarato

Yeah, that's a known limitation of the current caching implementation. If a worker receives a URL that is also handled by a different worker but the URL is not cached yet, it will be checked more than once. The reason is that we don't want to block workers on a global cache.

Do you get a lot of duplicate checks?

I'm thinking that each worker/client could get a handle to the shared cache, which could slightly reduce the duplications. However the bigger problem is that we would have to mark in-flight requests somehow to eliminiate all duplications.

Another alternative is to have one worker per domain; each with their own cache. This would fully eliminate duplicate requests and also make the rate-limiting implementation easier that we already have in mind. The downside is higher resource usage on many different domains (which is a common case). One would have to profile if that's a problem, though.

I'm open for suggestions, but in the end in comes down to tradeoffs I guess. Hope that helps.

mre avatar Jul 27 '22 20:07 mre

I am getting a fair number of duplicate pings, but as you say, it's a tradeoff.

I like the idea of hashing domains to workers, but obviously this has performance implications especially for people pinging just a couple domains 🤔

Perhaps a planning phase at initial load, which takes stock of the input and selects an execution strategy more dynamically? But that's getting pretty complicated and will increase maintenance/ debugging a lot, so... yeah ... tradeoffs.

khemarato avatar Jul 27 '22 22:07 khemarato

I don't think worker thread should have access to cache at all. A supervisor thread should push URLs to workers. It'll also be easier to implement rate limit I think.

lebensterben avatar Jul 27 '22 23:07 lebensterben

That's what we do already. The cache handling is outside of the workers. It currently is located in the binary code here, but I'm considering to move it to the library.

mre avatar Nov 11 '22 15:11 mre

I'll go ahead and close this. The potential improvements of refactoring are currently unclear given the trade-offs we discussed. We might revisit this again in the future. Thanks for the report @buddhist-uni.

mre avatar Dec 03 '22 00:12 mre