crawl icon indicating copy to clipboard operation
crawl copied to clipboard

Change body hashing method from SHA512 to Simhash

Open cstrouse opened this issue 6 years ago • 6 comments

The possibilities for duplicate content checking using SHA512 is limited. What do you think of swapping that out for Simhash so more nuanced comparisons of content would be possible?

The stopwords package already implements Simhash in Go and has a compatible license.

cstrouse avatar Nov 18 '19 05:11 cstrouse

Reference: https://ferd.ca/simhashing-hopefully-made-simple.html

benjaminestes avatar Nov 18 '19 22:11 benjaminestes

Reference: http://benwhitmore.altervista.org/simhash-and-solving-the-hamming-distance-problem-explained/

benjaminestes avatar Nov 18 '19 22:11 benjaminestes

@cstrouse I've read up and understand enough to see how this is useful for associating nearly duplicate content. However, I'm not sure how to use it in practice.

You'd have to calculate the Hamming distance between every pair of URLs (or at least every pair of unique simhashes in the dataset) to figure out which pairs are potentially similar.

While the crawler would calculate the simhash, calculating the distance between simhashes would be a post-crawl step. Do you have an idea of how you would approach that step in practice? If I invest in this I want to know that the associated analysis would be feasible.

benjaminestes avatar Nov 19 '19 04:11 benjaminestes

BigQuery supports user-defined functions in either SQL or Javascript so feasibly we could come up with a way to do the analysis there.

cstrouse avatar Nov 19 '19 04:11 cstrouse

I'll have a think about this. INT64 values aren't supported in JS UDFs, so would have to use a BYTES field.

benjaminestes avatar Nov 19 '19 18:11 benjaminestes

The docs say to use FLOAT64 when you need a number that would otherwise be an INT64.

cstrouse avatar Nov 20 '19 13:11 cstrouse