urldedupe icon indicating copy to clipboard operation
urldedupe copied to clipboard

Move to hashing instead of generating URL keys

Open larskraemer opened this issue 4 years ago • 3 comments

Instead of producing a string for each URL to use as a key, it is sensible to produce a hash. I pulled in the SpookyV2 hash library, which is reasonably compact and public domain, and produces 128-bit hashes. We definitely need a 128-bit or greater hash, as a 64-bit hash will probably collide after about 2^32 inputs, which would be about 256 GB of URLs with an average length of 64 bytes. This may have been fine, but to be safe, a 128-bit hash function lets us process ~1'000'000 PB of URLs, which is definitely fine. I also changed the is_asset and is_number functions to accept string_views, since that will work with both strings and string_views.

This PR depends on my last one. That is my mistake, and I don't know how to fix it.. Let me know if you need me to do something about that.

larskraemer avatar Jun 12 '20 01:06 larskraemer

I updated this with a better input method which requires fewer allocations. @ameenmaali

larskraemer avatar Jun 17 '20 23:06 larskraemer

Hey @larskraemer, thanks so much for the work here :D - I have been super busy/offline the past couple weeks as you've seen. I'll take a look at this when I get back to it in a few days. As a side note/first glance, is there any solution that doesn't require 3rd party code? That was one of my goals building this (hence why a lot of stuff could probably be easily done in an existing library). I've checked out the std::hash lib a bit, but not sure if it's got some issues. No worries if you think this is better, this looks pretty solid to me at the moment.

ameenmaali avatar Jul 02 '20 00:07 ameenmaali

@ameenmaali I noticed :D As for not using third party libraries; As mentioned above, we probably want a 128 bit hash. I don't think the standard library provides one, so we're stuck with either writing our own, or using somebody else's solution. As for writing your own hash function, that's hard, and the result will probably be pretty bad (not because you or I are dumb, but because, as mentioned, it is hard to write a good hash function). I get that this is a learning project, but in this case, I think the best lesson is probably "Don't write your own hash function" :D

larskraemer avatar Jul 02 '20 02:07 larskraemer