backend icon indicating copy to clipboard operation
backend copied to clipboard

Use simhash to detect blockpages

Open hellais opened this issue 8 years ago • 2 comments

Mentioned by @darkk in https://github.com/TheTorProject/ooni-pipeline/pull/62

hellais avatar Sep 29 '17 15:09 hellais

The original goal was to have similarity metrics for http.body to be able to cluster similar pages together, cluster those pages on Domain / URL and see if two unrelated pages map to alike bodies. Those were candidates for blockpages (or webserver error pages, or cloudflare captcha pages, etc).

Proof of concept was done years ago and found some new blockpages, but integrating that into main pipeline took so long, that it's unclear if that heuristics is still practical and, anyway, it need manual labor for post-processing.

It's also unclear if current hashing parameters are the most suitable as it's currently unclear what the target function for "suitability" is. Those values may be used to speed-up reprocessing for issues like ooni/pipeline#79, ooni/pipeline#84, ooni/pipeline#100, but maybe it's easier to spend some extra credits on CPU at Amazon and do a full-scan with several hundreds cores.

So, it's unclear if the issue should be still open or if it should be rather closed in favor of "Drop simhash calculation" issue.

darkk avatar May 31 '19 13:05 darkk

In https://github.com/ooni/probe/issues/1727 I run some experiments using TrendMicro's tlsh to measure pages equality. I see an hash such as tlsh useful to state whether the page returned by the control is the same as the page observed by the probe. On the contrary, simhash is useful to measure the distance but it's still unclear to me exactly how we can reduce that into the space of equality.

bassosimone avatar Aug 06 '21 12:08 bassosimone