covid-backend icon indicating copy to clipboard operation
covid-backend copied to clipboard

Golomb Coded Sets + K-anonymity to preserve privacy on match confirmation

Open sdktr opened this issue 4 years ago • 16 comments

Golomb Coded Sets are a way to compress many numbers that are geometrically distributed - like cryptographic keys. It may even be possible to use compressed sets with only partial hashes, and a confirmation API in case of a hit. @ahupowerdns

GCS seem a very attractive way to reduce the needed transport bytes to distribute the hitlist to (mobile) consumers. It allows for false positives though. A match to this optimized set will have to be validated to be completely sure about the hit.

This provides a possible privacy concern when these 'confirmation requests' are logged by intermediate parties (like providers, enterprise proxies, CDNs) or the masterDB operators. They can see a request to a 'conformation API', and have info to tie this to an endpoint (e.g. via IP) and depending on the response have a way of knowing whether the client got a match and is assumed to be infected.

We can avoid that by using the K-anonymity model for the validation API.

  1. Clients gets a hit on the compressed GCS list (example key AAAABBBBCCCCDDDD)
  2. The validation API is queried for /validate/range/AAAABBBB
  3. A file with all complete keys that start with that prefix are returned (only the missing part of the hash to preserve data, compressed with XYZ)
  4. Client finds a match locally (or does not)

No party in the middle knows whether the client got a match: privacy preserved. Meanwhile the prefix-files can be massively cached/scaled out by a CDN-pool that serves as the read buffer between clients and the masterDB.

A client should request a bogus / made up range-file each day to completely loose visibility on possible privacy sensitive data server-side. A request to the validation API therefore doesn't reveal any usable info. Middle-man (including masterDB) can't tell whether it was a:

  • bogus random request
  • GCS false positive
  • GCS conformation In all three cases privacy and massively scale-able read capacity is preserved

sdktr avatar Apr 12 '20 14:04 sdktr

Is this applicable against hashes that have been salted (it looks like salting if you squint a little) and used in a TOTP/HMAC fashion like A/G do? (see: https://covid19-static.cdn-apple.com/applications/covid19/current/static/contact-tracing/pdf/ContactTracing-CryptographySpecification.pdf)

To me it seems that we are on the wrong side of the sha256 hash-function at least one time to many for this trick to work.

As per the definitions in the A/G standard (and as I noted earlier in a different issue, the other protocols are quite similar):

  • Tracing key: tk <- CRNG(32)
  • Daily Tracing Key: dtk_i <- HKDF(tk, NULL, (UTF8("CT-DTK")||D_i),16)
  • Rolling Proximity Identifier: RPI_(i,j) <- Truncate(HMAC(dkt_i, (UTF8("CT-RPI")|| TIN_j)),16)

Furthermore, I'd like to point out that we do create an additional privacy-problem with K-anonymity, because the user leaks a certain number of lookups. In the Troy Hunt password database, this is not a significant problem, because most users request only a few password-hashes from Troy Hunt's API.

With contact tracing we have a totally different beast on our hands, as the number of requests a user has to send and the width of the required ranges, is approximately proportional to the number of other persons the user has come into close contact with in a given time-interval.

With contact tracing, this is directly proportional to the risk the person has of contracting the virus... Which in itself can be qualified as a potential leak of sensitive data. It is not beyond my imagination that in a stricter regulatory regime (e.g. France), this could even be used to search for and fine people who disobey the law.

To me, this seems to be a solution to a problem we don't have, because the hashes are not unsalted and their input is the output of a CRNG, which should fundamentally be useless anyway.

Remember that in the expected case, the closest our backend comes to tk, is in the form of a relatively small sequence (less than 50) of sha256 hashes derived from tk.

We should also take into consideration what it means if a user publishes such a sequence of dtk's: It means the user has tested positive for COVID-19. At the end of the sequence of dtk's, the user is supposed to be either dead or immune with a relatively high probability.

Effectively this means that by far most users (probably more than 80%) will never ever submit a second set of dtk's to our backend. This in turn means that except for those <50 keys (which remain behind at least one salted hash-function), the tk secret of the user stays buried behind at least 2 layers of salted hash-functions for the remainder of the lifetime of the device.

But suppose that we are truly unlucky and do not develop a vaccine in the next 10 years and suppose we have a truly unlucky user that somehow constantly reappears on our list for 10 years in a row. In this case, only 3652 or 3653 dtk's are ever made public. It's reasonable to assume that by then, the device is replaced (so there is a new tk).

I sincerely doubt that it's possible to find the 32 byte (512 bits) input of the SHA256 hash-function within 10 years of time with only 3652 to 3653 sequential hashes of 16 bytes (128 bits) length at your disposal. I mean no offense, but I truly think the XCKD on security (https://xkcd.com/538/) applies here. Especially because there is a simple and explainable solution that removes the risk of false-positives due to our back end entirely: "Just publish all dtk's of actively infected people in chronological order and add a revocation list".

But maybe I'm missing or misunderstanding something because I am not smart enough.

However I'd also like to point out that we have to explain and defend how the privacy of users is kept intact. Which means that simple code and a simple explanation are also very important.

An explanation like "You have a higher probability of becoming the richest person on earth in your lifetime and you have a higher chance of getting all 14 digits of your pre-paid top-up code right in one go combined than you have of your identity leaking through this system", simply sells easier.

spycrowsoft avatar Apr 12 '20 18:04 spycrowsoft

I think you can only give a List of Diagnosis_Keys, see the Contact Tracing - Framework Documentation flow of Figure 1: image

Providing these List of Diagnosis_Keys can be done with signed content on a web CDN right?

bwbroersma avatar Apr 13 '20 12:04 bwbroersma

I think you can only give a List of Diagnosis_Keys

Exactly.

Providing these List of Diagnosis_Keys can be done with signed content on a web CDN right?

Yes. See my draft on https://github.com/ahupowerdns/covid-backend/pull/8 .

spycrowsoft avatar Apr 13 '20 15:04 spycrowsoft

Let me explain my understanding of @ahupowerdns take on this. It might explain my rationale for coming up with K-anonymity. I'll try to use the A/G terminology.

  1. The backend has minimal two parts. (A) Influx of Diagnoses Keys and (B) distribution of them
  2. An app can feed the list of (new) Diagnoses Keys to the OS api that A/G will build into their platforms. They will use these keys to regenerate the RPIs (the identifiers received via bluetooth) and check whether this device has heard of any RPI covered by the downloaded list.
  3. Based on @ahupowerdns remarks of GCS my assumption is that he proposes to use this as an optimalisation of the Diagnoses Keys distribution lists.
  4. A consequence of GCS is that a match in (2) MIGHT be a false-positive. To make sure we don't warn a user while in fact the DK used to diagnose that the device has been in reach is unreliable because of GCS compression
  5. That's why the DK used for diagnoses must be validated against the backend
  6. Requesting a literal match for a DK to the backend reveales info about this requestor that we (and intermediate parties) should not have
  7. hence this request should not use the complete DTK for a match to the validation API, but only a prefix of it. Then it can client side determine whether the DTK that was matched locally is in fact a valid DTK or not. Also clients should randomly make 'fake' validation requests to avoid API usage monitoring giving away any pointers to whoever is watching

I'm not proposing an additional hashing of keys. The DTKs are already formatted to be usable in the format as is.

With contact tracing, this is directly proportional to the risk the person has of contracting the virus... Which in itself can be qualified as a potential leak of sensitive data. It is not beyond my imagination that in a stricter regulatory regime (e.g. France), this could even be used to search for and fine people who disobey the law.

^^^ I'm only proposing the validation API for 100% confirmation of a match against a GCS compressed value. Not that we do a K-anonymity request for every RPI we heared on the device. Which by the way:

  • we can't access RPI according to A/G proposal (good thing)
  • RPI can't be traced back to the DTK that generated it (as you pointed out)

sdktr avatar Apr 13 '20 15:04 sdktr

If we can create some geo boundry, e.g. only The Netherlands, the maximum amount of identifiers is 17M × 14 days = 238M Diagnostic Keys, what are we optimizing? Even if it's done without a geo boundry 8B × 14 = 112B. The fact that the Daily Tracking Key is 16 bytes (random, so we cannot compress much) this is 238M × 16 bytes = 3.55GiB for The Netherlands or 104.3GiB for the whole world. This is the worst case, if everyone becomes infected the app becomes useless anyway. This app will probably only be used / functional if there is no lockdown, which means an infection level below the capacity of our hospitals, probably meaning < 4% percentage points of infections per month, if we would all use the app then, this 3.55GiB comes down to 142MiB/month.

bwbroersma avatar Apr 13 '20 16:04 bwbroersma

Don’t forget that:

  • ‘Diagnoses Keys’ is a list of multiple DTKs. Assuming X(?) days of incubation time you’ll end up with X times that amount of data needing redistribution
  • The assumed usage is gonna be very wide spread. An optimized list (whatever method is used) saves a big time on data, which means cost savings worth investigating

sdktr avatar Apr 13 '20 16:04 sdktr

I assumed X = 14 days.

bwbroersma avatar Apr 13 '20 17:04 bwbroersma

I assumed X = 14 days.

The reasonable upper bound we can expect for x is 46 days. That apparently has been the longest duration of the disease, including incubation period, which has been seen in China.

So we can assume the difference is at most a factor of 3.

This is about the size of a single youtube-video.

I've done more elaborate calculations on the worst-case scenario, in an e-mail I've sent to @ahupowerdns. I arrived at much higher numbers in the worst case that were in the traffic-range of what NLUUG is currently handling.

I'll open a new issue with those calculations in the near future.

spycrowsoft avatar Apr 13 '20 17:04 spycrowsoft

How long is the incubation period for COVID-19?

The “incubation period” means the time between catching the virus and beginning to have symptoms of the disease. Most estimates of the incubation period for COVID-19 range from 1-14 days, most commonly around five days. These estimates will be updated as more data become available.

Source: https://www.who.int/news-room/q-a-detail/q-a-coronaviruses

Also see https://www.worldometers.info/coronavirus/coronavirus-incubation-period/

a very long incubation period could reflect a double exposure.

It's probably not wise to change the whole concept to > 14 days, since this parameter is also in the hands of Google/Apple since it's an under laying parameter, I don't think should assume > 14 days.

bwbroersma avatar Apr 13 '20 17:04 bwbroersma

My interpretation of X in this discussion was: when tested positive on t=0, how big of a range of DTKs are going to be uploaded to our backend?

If I follow your worst case of 46 days, are we requesting DTK -46 to 0? Another point I didn’t consider: do we assume the DTKs for 0 to +46 will be uploaded as well? That would mean we’re warning anyone that gets within close range of a positively tested person for quite some time. We should at least assume this is going to be the case when designing the system. A HCP can always revoke a set of keys when they’re sure the patient is cured (tested negatively).

Whatever X will become, your data usage suddenly exploded to (tested patients * X * 2)

sdktr avatar Apr 13 '20 17:04 sdktr

That's why the DK used for diagnoses must be validated against the backend

I've stated in my long post above, why this might be a problem, as a guess about the number of interactions a user has, can be made from the amount of "back end" requests a user does, even if they do range-queries.

This is could become a can of worms which privacy-activists will happily use against our system.

I also stated that the user does not have access to the daily diagnosis keys from others from the proximity identifiers they are logging, because those are hashed AND salted diagnosis keys.

we can't access RPI according to A/G proposal (good thing)

Unless a user presents itself to a healthcare worker, tests positive and then uploads it's dtk's for the period in which they are infectious.

At that point in time, the dtk has to be distributed to all clients, after which they can check if any of their logged RPI's match against the dtk of the infected person.

So somehow, we have to get the dtk's of infectious persons out to all clients, so they can check against their logged RPI's, while minimizing the false-positive rate and also ensuring that there is no difference in traffic and/or access patterns.

I think privacy activists can live with large volumes of data being downloaded and some re-downloads, but I don't think they will accept specific targeted queries to eliminate false-positives.

spycrowsoft avatar Apr 13 '20 17:04 spycrowsoft

If I follow your worst case of 46 days, are we requesting DTK -46 to 0?

No.

The point is that this interval can be variable and has to be supplied by the healthcare professional as those probably are the most capable of estimating when the incubation began and when the spreading period will end.

It's then up to the user to input those dates into their app and upload their DayNumbers and dtks.

Quoting Apple's specification:

Matching Values from Users Tested Positive Upon a positive test of a user for COVID-19, their Diagnosis Keys and associated DayNumbers are uploaded to the Diagnosis Server. A Diagnosis Server is a server that aggregates the Diagnosis Keys from the users who tested positive and distributes them to all the user clients who are using contact tracing. In order to identify any exposures, each client frequently fetches the list of Diagnosis Keys. Since Diagnosis Keys are sets of Daily Tracing Keys with their associated Day Numbers, each of the clients are able to re-derive the sequence of Rolling Proximity Identifiers that were advertised over Bluetooth from the users who tested positive. In order to do so, they use each of the Diagnosis Keys with the function defined to derive the Rolling Proximity Identifier. For each of the derived identifiers, they match it against the sequence they have found through Bluetooth scanning. Additional validation can be used to confirm that the advertising happened in a time window comparable to the one expected based on the TimeIntervalNumber. Matches must stay local to the device and not be revealed to the Diagnosis Server.

So in theory, we could recieve any random number of dtks from a user if they choose to specify so. By the way: we have to guard against this somehow, because there are lots of kids that would love to use this as a legal way of skipping class.

But the interval is certainly not a static parameter like 14 days.

And because patients can still be infected when the incubation period has ended, the dtk's we can receive can be from any interval from -46 days to +46 days. So [-46, 0], [-4,+10] and [0,+46] could all be valid intervals.

46 just seemed like a reasonable number to cap my calculations and design-limits at.

spycrowsoft avatar Apr 13 '20 17:04 spycrowsoft

I understand we’re not requesting any specific range of DTKs, I was just making clear that the earlier made estimates on amount of data we must distribute was missing this factor. It’s merely a design aspect that we should be aware of. We agree that data should be managed by the owner or HCP only.

Wrt data usage vs privacy risk of verifying a match I think we’d need to see the numbers of a calculation. There’s a reasoning there that a device with a factor 1000 more captured RPIs is 1000 times more likely to make a legit request to the confirmation API. I think in both cases the changes of a match are so slim, that we’re completely safe when we make a request anyway, obfuscation anyone listening for these requests /bc they can’t tell the difference. When there are multiple matches to a GCS list that need confirmation we can stop requesting validation on the first hit. Depending on the GCS parameters the chances of a false positive (causing a second request) can be very low.

I agree there must not be any doubt around privacy. Would be nice to have @ahupowerdns elaborate on the GCS false positive problem

sdktr avatar Apr 13 '20 18:04 sdktr

I'll open a new issue with those calculations in the near future.

See: https://github.com/ahupowerdns/covid-backend/issues/10

spycrowsoft avatar Apr 13 '20 19:04 spycrowsoft

So in theory, we could recieve any random number of dtks from a user if they choose to specify so. By the way: we have to guard against this somehow, because there are lots of kids that would love to use this as a legal way of skipping class.

A very simple defense against this, would be to limit the set of keys we publish. It makes no sense to publish keys from before today - 14 days. It also makes no sense to publish keys beyond today + 14 days. That way, we can limit the impact of an attack like this.

If we only allow verified healthcare workers to add keys, an attack like this would become very laborious in practice. It would still be possible, but it would probably be beyond what an attacker would be capable of doing.

spycrowsoft avatar Apr 15 '20 09:04 spycrowsoft

Hi;

I am the creator of the k-Anonymous search proposal linked to in the original comment, and have also worked with Cornell to provide formal analysis and new developments to such protocols [ACM]. This work fed into the efforts to create Google Password Checkup by Google and Stanford [Usenix]

Before the pandemic, I was doing work on anonymising wireless unique identifiers (for example, in Bluetooth Journey Time Monitoring Systems). This work provides formal analysis and experimental data for applying k-Anonymity to hashes for the purpose of anonymisation. The pre-print of the paper is here (conference accepted). This can help with empirical analysis of managing data leaks, but doesn't quite match the situation discussed here: https://arxiv.org/abs/2005.06580

Recently, I've been working on using k-Anonymity to prevent de-anonymisation attacks in existing Contact Tracing protocols. I have formed a hypothesis for using a cross-hashing approach to provide a cryptographic guarantee for the minimum contact time and additionally prevent "Deanonymizing Known Reported Users". This uses a k-Anonymous search approach to reduce the communication overhead and additionally offer additional protection to data leaks from the server (using Private Set Intersection). The hypothesis can be found here alongside discussions of the risk factors - but do note there are no experimental results at this stage and this paper is not peer-reviewed: https://arxiv.org/abs/2005.12884

If anyone has any feedback on this work, please do reach out to me (my email is on the papers).

Thanks

IcyApril avatar May 27 '20 17:05 IcyApril