crawly
crawly copied to clipboard
Use tiny hash in `Crawly.Middlewares.UniqueRequest`
Instead of storing the whole URL in the spider's state in Crawly.Middlewares.UniqueRequest
, I suggest hashing the URL and storing it instead.
I guess taking the first quarter of a sha256 would do it, quick maths gives 2^(8*8) = 10^19 which is well enough.
When crawling large sites storing the full URLs can quickly becomes a bottleneck (in the dozens of GB).
Yes, I would agree I've never planned to use a full URL here, however, decided to keep it this way, as it was faster at that point (I am talking about the development speed). But I agree we should optimize it.
Actually a quarter (64 bytes) might not be enough, because of the birthday attack. Collision could happen with a 50% as soon as 10^9 pages are parsed, which can seem a lot but is not (counting pages with parameters for example).
But I'm no mathematician, someone needs to double-check this!
making hash functions optional is a possibility, for users that would like to trade correctness for memory improvements
- [ ] add in hash function to middleware
- [ ] add in boolean option for enabling hasing