covid-backend icon indicating copy to clipboard operation
covid-backend copied to clipboard

Withdrawals or timestamping

Open sdktr opened this issue 4 years ago • 6 comments

Who's responsible for withdrawals?

Note that delta-sets also include deletion events, either because the key is no longer relevant (post 14 days) or because we corrected a mistake.

Should we publish withdrawals in the updates? Or are keys (the initial 'breach' published) timestamped to the moment of testing (known infected timestamp) and should the app determine what the appropriate timerange is for a warning?

We could keep a 'worst case' (e.g 21 days) data retention on the backend and after that simple delete the data. The last known state a client has, is the last update it got from the backend (could be a delete as well!) keeping a local definition of whether that state is a problem (within the time range of relevance) or not.

sdktr avatar Apr 12 '20 09:04 sdktr

I propose we create a "delta-set" which represents a timeline of received and retracted keys by doing something like this:

We use 16 byte (128 bits) long keys. These are the dtk's or day-keys.

For distribution we simply concatenate the key into one key file, which we then GZip.

We create two folders, one for keys representing active infections and one for keys which have been retracted.

Inside both folders we use something like "YMMDDHH-YMMDD.gz" as file-naming naming convention, where:

  • YMMDDHH denotes the end of the time interval (UTC) in which the key was submitted to our back end server.
  • YMMDD denotes the day (UTC) for which the dtk's / day-keys in this file are considered active infections or retracted.

On each update we run "sha256sums -.gz > SHA256SUMS" in each folder and we sign the SHA256SUMS file with something like SSL, PGP or something else.

This way, we can distribute the data with assured integrity, authenticity in a cachable way and we provide clients with simple index and a method of only downloading delta-files.

Note that a keys that were incorrectly retracted, can be made active again by simply re-uploading them in a file in the "active" folder in the following hour.

Also: updating the data on all the webservers can be as simple as pushing out two folders through rsync or something similar.

Of course this is by no means perfect and can probably be improved upon.

spycrowsoft avatar Apr 12 '20 10:04 spycrowsoft

Inside both folders we use something like "YMMDDHH-YMMDD.gz" as file-naming naming convention, where:

  • YMMDDHH denotes the end of the time interval (UTC) in which the key was submitted to our back end server.

  • YMMDD denotes the day (UTC) for which the dtk's / day-keys in this file are considered active infections or retracted.

Does this mean that in theory it's possible that per submit hour 14 files can be generated (considering that the data for 14 previous days will be exposed)? Why not adding the day numbers related to the DTK's inside the file and just generate 1 file per hour?

I suggest to publish it as day numbers how they are calculated in the Google/Apple API proposal instead of YMMDD, this omits that it have to be translated between a date/time format to the day numbers as generated on for the mobile phone app.

Peter-Slump avatar Apr 12 '20 19:04 Peter-Slump

Inside both folders we use something like "YMMDDHH-YMMDD.gz" as file-naming naming convention, where:

  • YMMDDHH denotes the end of the time interval (UTC) in which the key was submitted to our back end server.
  • YMMDD denotes the day (UTC) for which the dtk's / day-keys in this file are considered active infections or retracted.

Does this mean that in theory it's possible that per submit hour 14 files can be generated (considering that the data for 14 previous days will be exposed)?

Yes.

Why not adding the day numbers related to the DTK's inside the file and just generate 1 file per hour?

Because I wanted to maximize the transparancy while keeping the file-size small. It might also be easier on the app-developers.

I suggest to publish it as day numbers how they are calculated in the Google/Apple API proposal instead of YMMDD, this omits that it have to be translated between a date/time format to the day numbers as generated on for the mobile phone app.

So you want the file-format to become something like "YMMDDHH-UUUUU.gz", where U is a digit in Unix-time hours?

I'm fine with that as well. Like I said, my initial design is by no means perfect.

spycrowsoft avatar Apr 12 '20 19:04 spycrowsoft

Because I wanted to maximize the transparancy while keeping the file-size small. It might also be easier on the app-developers.

File size should not be that different if split over multiple files or all combined in one file. One file will probably also be easier for app-developers otherwise they will have to download multiple files. Or at least try to download them since they don't know if they exist; what if there are no infections on a certain day/submit hour combination, will that result in an empty file or in no file?

As said two HTTP entry points will be good enough. One listing the exposures as submitted on a given hour (since Unix Epoch). And one listing the retracted keys on a given hour. This can be a plain text file or a CSV-like file with a DTK and associated day number per line. The entry points can be as simple as /exposures/<hour number> and /retractions/<hour number> (naming is off course open for improvements). Those HTTP(s) entry points can be placed behind a CDN. This CDN will take care of the compression of the data and HTTPS will take care of the authenticity.

Peter-Slump avatar Apr 12 '20 20:04 Peter-Slump

Adding this for reference because it is also relevant here: https://github.com/ahupowerdns/covid-backend/issues/7#issuecomment-613015708

spycrowsoft avatar Apr 13 '20 20:04 spycrowsoft

Because I wanted to maximize the transparancy while keeping the file-size small. It might also be easier on the app-developers.

File size should not be that different if split over multiple files or all combined in one file. One file will probably also be easier for app-developers otherwise they will have to download multiple files. Or at least try to download them since they don't know if they exist; what if there are no infections on a certain day/submit hour combination, will that result in an empty file or in no file?

As said two HTTP entry points will be good enough. One listing the exposures as submitted on a given hour (since Unix Epoch). And one listing the retracted keys on a given hour. This can be a plain text file or a CSV-like file with a DTK and associated day number per line. The entry points can be as simple as /exposures/<hour number> and /retractions/<hour number> (naming is off course open for improvements). Those HTTP(s) entry points can be placed behind a CDN. This CDN will take care of the compression of the data and HTTPS will take care of the authenticity.

I fully agree with you.

There is even an additional argument to go for "one file per time-interval" including the DayNumbers inside the file and not putting them in the filename:

Every file we create, will have to be signed in some way. The overhead introduced by the signature, can easily dominate the data in the file which consists of DayNumbers and dtks in the form of sequences of 20 bytes.

A system like GZip will compress simmilar DayNumber sequences and give us even further gains.

Also note that GZip is included in many HTTP-request libraries themselves, so we can even skip the compression-step and publish plain-text or plain-binary files.

Therefore, it is probably more efficient to use a one file per hour approach.

It also allows us to dynamically change our update-frequency, because we can extend the file naming format to YMMDDHHmm.gz, which allows us to update every minute if we so desire.

And clients only need to do 2+files requests: One to check if a new update is available, a second to check the file-index and then one request per new file.

spycrowsoft avatar Apr 16 '20 11:04 spycrowsoft