privacybadger icon indicating copy to clipboard operation
privacybadger copied to clipboard

Design and implement a fancier heuristicblock data structure

Open pde opened this issue 11 years ago • 10 comments

Currently the httpRequestOriginFrequency data structure needs to store fragments of browsing history locally within the browser, in order to detect which 3rd party domains are tracking. We should migrate to a fancier, higher-privacy data structure that prunes at the right time, uses HMAC fragments instead of origin names, and takes up less space.

pde avatar Aug 18 '14 18:08 pde

@jcb82 expressed interest in designing this.

pde avatar Aug 18 '14 18:08 pde

My understanding of what we want is a client-side data structure to store counts of pairs (first party, third party) which we'll call (x, y) such that when the same third party y has been detected for three or more different first parties, y is marked as a tracker domain. Except we don't want a fine-grained history to be stored. Any more information on the threat model and exactly how much leakage we can tolerate?

I would propose a two-level Bloom filter: keep N bloom filters around. When x, y is observed, compute H(x||i) for 0 < i < k. For each index i, update bloom filter i with value y. Once j of N bloom filters (for j =3k-e) have been updated with y, y is considered a tracking domain. We can tune i, N, k, j such that we have a high probability of marking tracking domains.

There might be a simpler approach, I'm not totally sure what the design requirements are here.

jcb82 avatar Aug 23 '14 00:08 jcb82

This was discussed in the bi-weekly developers meeting on 9/27.

Threat model: A local attacker would be able to view a subset of cleared history (only visited domains, not urls or times). This solution does not actually help against that since the attacker could still see if the users has visited site n by taking H(n) and searching for it in privacy badger's database. what does this stop: an attacker who can view pb data but doesn't know how to hash strings.

other option: respect history clearing events, and remove domains from snitch map when they are removed from history. problem: this will break privacy badger if the user consistently clears their entire history.

cooperq avatar Sep 28 '16 01:09 cooperq

We definitely should do this.

pde avatar Dec 15 '16 00:12 pde

In particular, my instinct is that we should do the following two things:

  • [ ] When tracker domains get redlisted, remove the three first party entries that caused them to be redlisted.
  • [ ] Maybe replace the first party domain names with a 1-2 hex digit HMAC slice. That's a simple special case of a bloom filter; it will make the algorithm a bit slower to block some trackers, but not terribly so.

pde avatar Dec 15 '16 00:12 pde

respect history clearing events, and remove domains from snitch map when they are removed from history

I don't think this part is optional; I think this is a hard requirement if you are going to maintain the full domain name.

pjlsergeant avatar Dec 15 '16 01:12 pjlsergeant

Maybe replace the first party domain names with a 1-2 hex digit HMAC slice. That's a simple special case of a bloom filter; it will make the algorithm a bit slower to block some trackers, but not terribly so.

This is a great idea; it gives deniability, you can use a very very fast HMAC digest algorithm for it, and the only draw back I see is that once in x times you'll undercount unique domains. This is a winner, unless there's some technical drawback I'm not thinking of.

pjlsergeant avatar Dec 15 '16 01:12 pjlsergeant

I don't think this part is optional; I think this is a hard requirement if you are going to maintain the full domain name.

I agree that leaving cleartext history samples around is unacceptable if the user clears their history. But I've been wondering if 1-hex-digit records are acceptable in that case.

There are some corner cases where a small amount of information about browsing history may be genuinely inferrable from those records. For instance, if a third party domain is on a grand total of 5 first party domains across the Web, then thirdparty-exampe.com : ["5", "A", "C"] probably allows an attacker to infer three visited first parties from the five possibilities. But such cases would be quite rare, and it isn't clear that these records are much worse than having thirdparty-example.com in our blocklist, which is more profoundly unavoidable.

pde avatar Dec 15 '16 21:12 pde

The corner cases still leave you with a deniability factor that the plaintext domains don't. I think the single(or double)-character hash here is a clear winner.

pjlsergeant avatar Dec 16 '16 05:12 pjlsergeant

I don't see a way to be notified of history-clearing events at this time without asking for the "history" permission, which will trigger a new permission warning, which seems unacceptable for Privacy Badger at this point (will lose a significant percentage of users in Chrome, and have a similar percentage stop upgrading in Firefox).

ghostwords avatar Mar 31 '18 17:03 ghostwords