urldedupe
urldedupe copied to clipboard
Move to hashing instead of generating URL keys
Instead of producing a string
for each URL to use as a key, it is sensible to produce a hash.
I pulled in the SpookyV2 hash library, which is reasonably compact and public domain, and produces 128-bit hashes. We definitely need a 128-bit or greater hash, as a 64-bit hash will probably collide after about 2^32 inputs, which would be about 256 GB of URLs with an average length of 64 bytes. This may have been fine, but to be safe, a 128-bit hash function lets us process ~1'000'000 PB of URLs, which is definitely fine.
I also changed the is_asset
and is_number
functions to accept string_view
s, since that will work with both string
s and string_view
s.
This PR depends on my last one. That is my mistake, and I don't know how to fix it.. Let me know if you need me to do something about that.
I updated this with a better input method which requires fewer allocations. @ameenmaali
Hey @larskraemer, thanks so much for the work here :D - I have been super busy/offline the past couple weeks as you've seen. I'll take a look at this when I get back to it in a few days. As a side note/first glance, is there any solution that doesn't require 3rd party code? That was one of my goals building this (hence why a lot of stuff could probably be easily done in an existing library). I've checked out the std::hash lib a bit, but not sure if it's got some issues. No worries if you think this is better, this looks pretty solid to me at the moment.
@ameenmaali I noticed :D As for not using third party libraries; As mentioned above, we probably want a 128 bit hash. I don't think the standard library provides one, so we're stuck with either writing our own, or using somebody else's solution. As for writing your own hash function, that's hard, and the result will probably be pretty bad (not because you or I are dumb, but because, as mentioned, it is hard to write a good hash function). I get that this is a learning project, but in this case, I think the best lesson is probably "Don't write your own hash function" :D