crawly icon indicating copy to clipboard operation
crawly copied to clipboard

Use tiny hash in `Crawly.Middlewares.UniqueRequest`

Open tanguilp opened this issue 3 years ago • 2 comments

Instead of storing the whole URL in the spider's state in Crawly.Middlewares.UniqueRequest, I suggest hashing the URL and storing it instead.

I guess taking the first quarter of a sha256 would do it, quick maths gives 2^(8*8) = 10^19 which is well enough.

When crawling large sites storing the full URLs can quickly becomes a bottleneck (in the dozens of GB).

tanguilp avatar Dec 04 '20 12:12 tanguilp

Yes, I would agree I've never planned to use a full URL here, however, decided to keep it this way, as it was faster at that point (I am talking about the development speed). But I agree we should optimize it.

oltarasenko avatar Dec 07 '20 14:12 oltarasenko

Actually a quarter (64 bytes) might not be enough, because of the birthday attack. Collision could happen with a 50% as soon as 10^9 pages are parsed, which can seem a lot but is not (counting pages with parameters for example).

But I'm no mathematician, someone needs to double-check this!

tanguilp avatar Dec 07 '20 18:12 tanguilp

making hash functions optional is a possibility, for users that would like to trade correctness for memory improvements

  • [ ] add in hash function to middleware
  • [ ] add in boolean option for enabling hasing

Ziinc avatar Nov 26 '22 18:11 Ziinc