dka icon indicating copy to clipboard operation
dka copied to clipboard

Add correction for padding multiplier in "Verteilung TRL"

Open daimpi opened this issue 3 years ago • 28 comments

The plots in "Verteilung Transmission Risk Level (TRL) in Diagnoseschlüsseln" currently use the number of keys transmitted including the padded fake keys afaiu. As long as the padding factor stays the same this shouldn't be a problem. But this factor will change from tomorrow on (the plan is to bring it down to 1 eventually). The changes in the padding multiplier will cause some distortion in those graphs as new data will receive less weight.

My suggestion would be to use the data which has been corrected for this multiplier like in the "Geteilte Diagnoseschlüssel von positiv getesteten Personen" section. @mh- has introduced an automatic detection for the multiplier used in the data set in his parsing tool: https://github.com/corona-warn-app/cwa-server/issues/620#issuecomment-652511087

daimpi avatar Jul 01 '20 20:07 daimpi

Is the tool by mh- able to see if the padding change really applied for the whole daily package? The upload numbers for today seem to be quite high. If that is a real increase I am happy ;-)

cfritzsche avatar Jul 03 '20 05:07 cfritzsche

Is the tool by mh- able to see if the padding change really applied for the whole daily package? The upload numbers for today seem to be quite high. If that is a real increase I am happy ;-)

No, the new multiplier 5 was applied during the day, so for more correct values you would have to use the hourly key packages. 2 or 3 of these still used 10.

mh- avatar Jul 03 '20 05:07 mh-

No, the new multiplier 5 was applied during the day, so for more correct values you would have to use the hourly key packages. 2 or 3 of these still used 10.

It was changed at 11 AM CEST (9 AM UTC; package 9; vide infra)

Is the tool by mh- able to see if the padding change really applied for the whole daily package? The upload numbers for today seem to be quite high. If that is a real increase I am happy ;-)

Good news! I checked this twice and also uploaded the hourly packages. The number of users seems to be correct:

sum of hourly packages: 3+2+5+8+7+4+3+5 = 37 users


daily package:

37 user(s) found.
They submitted these numbers of keys:
4 user(s): 1 Diagnosis Key(s)
3 user(s): 4 Diagnosis Key(s)
1 user(s): 5 Diagnosis Key(s)
1 user(s): 6 Diagnosis Key(s)
1 user(s): 7 Diagnosis Key(s)
1 user(s): 8 Diagnosis Key(s)
1 user(s): 9 Diagnosis Key(s)
25 user(s): 13 Diagnosis Key(s)
80 keys not parsed (16 without padding).
37 / 4*1, 3*4, 1*5, 1*6, 1*7, 1*8, 1*9, 25*13

hourly package 6:

Length: 390 keys
Padding Multiplier detected: 10
3 user(s) found.
They submitted these numbers of keys:
3 user(s): 13 Diagnosis Key(s)
0 keys not parsed (0 without padding).
3 / 3*13

hourly package 7:

Length: 260 keys
Padding Multiplier detected: 10
2 user(s) found.
They submitted these numbers of keys:
2 user(s): 13 Diagnosis Key(s)
0 keys not parsed (0 without padding).
2 / 2*13

hourly package 9:

Length: 220 keys
Padding Multiplier detected: 5
5 user(s) found.
They submitted these numbers of keys:
1 user(s): 1 Diagnosis Key(s)
1 user(s): 4 Diagnosis Key(s)
3 user(s): 13 Diagnosis Key(s)
0 keys not parsed (0 without padding).
5 / 1*1, 1*4, 3*13

hourly package 11:

Length: 200 keys
Padding Multiplier detected: 5
8 user(s) found.
They submitted these numbers of keys:
1 user(s): Invalid Transmission Risk Profile
2 user(s): 1 Diagnosis Key(s)
1 user(s): 3 Diagnosis Key(s)
2 user(s): 4 Diagnosis Key(s)
1 user(s): 8 Diagnosis Key(s)
1 user(s): 13 Diagnosis Key(s)
Old Android app used by 1 user(s).
30 keys not parsed (6 without padding).
8 / 2*1, 1*3, 2*4, 1*8, 1*13 (1 old Android app(s))

hourly package 14:

Length: 345 keys
Padding Multiplier detected: 5
7 user(s) found.
They submitted these numbers of keys:
1 user(s): 1 Diagnosis Key(s)
1 user(s): 7 Diagnosis Key(s)
1 user(s): 9 Diagnosis Key(s)
4 user(s): 13 Diagnosis Key(s)
0 keys not parsed (0 without padding).
7 / 1*1, 1*7, 1*9, 4*13

hourly package 15:

Length: 195 keys
Padding Multiplier detected: 5
4 user(s) found.
They submitted these numbers of keys:
1 user(s): 1 Diagnosis Key(s)
1 user(s): 8 Diagnosis Key(s)
2 user(s): 13 Diagnosis Key(s)
20 keys not parsed (4 without padding).
4 / 1*1, 1*8, 2*13

hourly package 16:

Length: 195 keys
Padding Multiplier detected: 5
3 user(s) found.
They submitted these numbers of keys:
3 user(s): 13 Diagnosis Key(s)
0 keys not parsed (0 without padding).
3 / 3*13

hourly package 19:

Length: 155 keys
Padding Multiplier detected: 5
5 user(s) found.
They submitted these numbers of keys:
1 user(s): Invalid Transmission Risk Profile
1 user(s): 1 Diagnosis Key(s)
1 user(s): 5 Diagnosis Key(s)
1 user(s): 7 Diagnosis Key(s)
1 user(s): 13 Diagnosis Key(s)
Old Android app used by 1 user(s).
25 keys not parsed (5 without padding).
5 / 1*1, 1*5, 1*7, 1*13 (1 old Android app(s))

micb25 avatar Jul 03 '20 09:07 micb25

I assume the package at 19: has only 3 users. Old android should no longer be possible after pushing server version 1.0.9 online https://github.com/corona-warn-app/cwa-server/issues/640

one user 13 keys (1.7-19.6), 1 user 12 keys (1.7-19.6; has no key for 24.6), and 1 user 6 keys (1.7-26.6)

or 4 Users if no hole is allowed: one user 13 keys (1.7-19.6), 1 user 7 keys (1.7- 25.6), 1 user 6 keys (1.7-26.6), and 1 user 5 keys (23.6-19.6)

Tho-Mat avatar Jul 03 '20 09:07 Tho-Mat

I assume the package at 19: has only 3 users. Old android should no longer be possible after pushing server version 1.0.9 online corona-warn-app/cwa-server#640

This also affects package 11: one user with an "Invalid Transmission Risk Profile". So, it might be 2 users less for yesterday.

micb25 avatar Jul 03 '20 09:07 micb25

It was changed at 11 AM CEST (9 AM UTC; package 9; vide infra)

I think you are right, but: how do you know that? 2*5 = 10, so the package at 06: and 07: could also have a multipler of 5.

Tho-Mat avatar Jul 03 '20 09:07 Tho-Mat

how do you know that? 2*5 = 10, so the package at 06: and 07: could also have a multipler of 5.

You are absolutely right. My claim was just based on the inspection of the hourly packages. I don't see any way to improve the estimated numbers for yesterday. Hopefully, we do not see these multiplier changes too frequently.

micb25 avatar Jul 03 '20 10:07 micb25

I assume the package at 19: has only 3 users. Old android should no longer be possible after pushing server version 1.0.9 online corona-warn-app/cwa-server#640

This also affects package 11: one user with an "Invalid Transmission Risk Profile". So, it might be 2 users less for yesterday.

for package 11 i get 7 users with hole and 8 users without hole.

Tho-Mat avatar Jul 03 '20 10:07 Tho-Mat

I am not sure if this is the correct place here, but you may have seen the Spiegel interview with Mr. Spahn (here (paywall). He says:

SPIEGEL: Wie viele Infektionen wurden inzwischen in der App eingetragen? Spahn: Wir gehen von rund 300 Infektionen aus, die bislang per App gemeldet wurden. Das ist die Zahl der Verschlüsselungs-Codes, die von der Hotline ausgegeben wurden, um andere zu warnen. Mehr wissen wir aus Datenschutzgründen nicht.

Do you think people would go through the trouble of calling the hotline and then not submit, or is there an issue with the padding factor calculation that leads to a result that is off by a factor of two?

kai-truempler avatar Jul 03 '20 11:07 kai-truempler

my guess is that it's in the first day, since noone knows how the packet from "2020-06-23" is actually padded

janpf avatar Jul 03 '20 11:07 janpf

Do you think people would go through the trouble of calling the hotline and then not submit, or is there an issue with the padding factor calculation that leads to a result that is off by a factor of two?

@kai-truempler: Thanks for sharing this. I totally agree and I would rather expect people not to call the hotline in case of a positive test (stigma, time, effort, etc.).

my guess is that it's in the first day, since noone knows how the packet from "2020-06-23" is actually padded

@janpf: This might be an issue, however, I want to point out that every day there's a significant number of keys which get not parsed (vide infra).

Thus, I would expect that the estimates by diagnosis-keys from @mh- are rather conservative (which I personally prefer). I may add a chart with these unparsed key numbers. At the end, it would be beneficial, if these number would be published by an official institution such as the RKI on a daily base (in addition of publishing the daily download counts).

2020-06-23.dat:89 keys not parsed (8 without padding).
2020-06-24.dat:30 keys not parsed (3 without padding).
2020-06-25.dat:50 keys not parsed (5 without padding).
2020-06-26.dat:150 keys not parsed (15 without padding).
2020-06-27.dat:250 keys not parsed (25 without padding).
2020-06-28.dat:40 keys not parsed (4 without padding).
2020-06-29.dat:100 keys not parsed (10 without padding).
2020-06-30.dat:160 keys not parsed (16 without padding).
2020-07-01.dat:290 keys not parsed (29 without padding).
2020-07-02.dat:80 keys not parsed (16 without padding).

micb25 avatar Jul 03 '20 12:07 micb25

Oh absolutely true, I forgot about those "keys not parsed"

What might be beneficial: on my dashboard I just changed to an hourly analysis, as suggested above by @mh-. This means I check every hourly package and calculate the padding, number of keys, number of users etc. individually and then sum things up.

This way I'm currently at a total of 218 users and thereby off by a factor of 1.37 @kai-truempler ;) And if we now consider "keys not parsed" and Mr. Spahn maybe rounding numbers a bit I think it's very hard to get closer to the real number.

At the end, it would be beneficial, if these number would be published by an official institution such as the RKI on a daily base (in addition of publishing the daily download counts).

Absolutely.

janpf avatar Jul 03 '20 13:07 janpf

With parsing all keys you can get a minimum number of infected persons.
Theoretically each single key could belong to one person (the maximum).

If i count the minimum users that submit keys i get round about 250. (23.6. - 02.07.) So there may be 250 users => 300=round(250;-2)

Tho-Mat avatar Jul 03 '20 13:07 Tho-Mat

And I'm back down to 188 as the parser just got updated: https://github.com/mh-/diagnosis-keys/commit/104388c7785ef4870e04e34e4290422b756e1ead

janpf avatar Jul 03 '20 15:07 janpf

Ok, maybe I could change the strategy, now that "old Android apps" cannot submit Diagnosis Keys anymore. For this, it would be nice to understand what information you need from the parsing.

For example, just counting the number of users is very simple now, it would just require counting all keys with TRL 6, because every user will submit exactly one key with that TRL. (And of course divide by the padding multiplier.)

The harder part is to count the number of keys per each user, something that I wanted to do in order to find out if keys can be linked together (violating the "non-linkability-across-multiple-day" promise).

So what exactly do you want from the parser?

mh- avatar Jul 03 '20 16:07 mh-

Great idea counting the "6"s! Gonna change to that later for the overall user count and most likely going to keep your "counting script" as is for the "number of keys published per user".

Update: did change it and now we're back up to ~200. So still pretty far from the announced 300, but since there are only ~200 "6"s in the database this should be pretty reliable.

janpf avatar Jul 03 '20 17:07 janpf

I added the option -n / --new-android-apps-only to the parser script. If you use this, this should decrease the number of unparsed keys. However, in the near future it might become impossible to do correct counting, see the end of https://github.com/mh-/diagnosis-keys/blob/master/doc/algorithm.md for details.

mh- avatar Jul 04 '20 06:07 mh-

However, in the near future it might become impossible to do correct counting, see the end of https://github.com/mh-/diagnosis-keys/blob/master/doc/algorithm.md for details.

Just looking at the example you provided there, you can still at least provide the minimum user count. You can still have the case that it is in fact more users transmitting only random unconnected days, but if you have too many „1“s or „6“s than one user can have, it’s still at least two users. You could collect the minimum user count per risk level (minimum number for the 1s, 6s etc) and then take the max() of them to come to the absolute minimum users generating these keys.

cfritzsche avatar Jul 04 '20 09:07 cfritzsche

Yes, in the example with the 14 keys, there must have been between 2 and 14 users. This is a wide range, though.

mh- avatar Jul 04 '20 12:07 mh-

Ok, sure, but in almost all cases it will be the minimum number or very close to it. Which is good enough for the kind of analytics most are looking for.

cfritzsche avatar Jul 04 '20 18:07 cfritzsche

Note: If you download the hour/day package you will notice, that they will change their content. I seams keys with date<14 days will be deleted. Also the keys are moved to other days. 60 Key of 24.06 are now moved to 23.06. Also 44 keys are deleted for 23.06. For 23.06 hour files have changed from 08, 13, 17 => 10, 15, 18. So to get the right keys you have to use the files downloaded one the day they have been published.

Tho-Mat avatar Jul 05 '20 07:07 Tho-Mat

I have made an excel tab and did an manual examination of the keys. I have taken into account, that a device could be switched off for 1 or more day. Nearly every key-chain could be assigned. Only the 23.06- 8:00 keys are not so clear. 01.07. 17:00 is the only one, that contains a chain with no "6". I think, the 6 was not submitted/deleted since it was to old (17.06). After all i get 219(minimum) users, that submit keys. The maximum should be 241.

https://github.com/Tho-Mat/corona-stuff/blob/master/%C3%BCberblick.xlsx

Tho-Mat avatar Jul 05 '20 08:07 Tho-Mat

Note: If you download the hour/day package you will notice, that they will change their content. So to get the right keys you have to use the files downloaded one the day they have been published.

Are there any information on why they would do this?

After all i get 219(minimum) users, that submit keys.

Just by counting "6"s I get 231 with the "new" packages for 23./24. and 226 with the old ones. And this method is still more a lowerbound, since it misses some, as you correctly pointed out:

01.07. 17:00 is the only one, that contains a chain with no "6".

Update: I noticed you're doing a "per-key"-padding analysis, while I'm on a "per-package"-basis. That explains the differences. 👍

janpf avatar Jul 05 '20 08:07 janpf

Are there any information on why they would do this?

I think they will reduce traffic, since it makes no sense to check keys, that are older than 14 day.

Tho-Mat avatar Jul 05 '20 08:07 Tho-Mat

Note: If you download the hour/day package you will notice, that they will change their content. I seams keys with date<14 days will be deleted. Also the keys are moved to other days.

@Tho-Mat: Thanks for your comment. At first, I was already a little bit confused last night, because the old hourly packages were changed. My wrong assumption was that the clean-up of the keys older than 14 days is based on a package level and not on the individual key level.

micb25 avatar Jul 05 '20 14:07 micb25

Just as an update to my previous comment, from Phoenix:

Lothar Wieler: "...[rund] 500 Teletans sind ausgegeben worden."

That looks closer to the estimate than the 300 from Mr. Spahn 10 days ago.

kai-truempler avatar Jul 13 '20 16:07 kai-truempler

That looks closer to the estimate than the 300 from Mr. Spahn 10 days ago.

Fortunately, the RKI is publishing these numbers on a weekly basis. Thus, I have added another diagram for the published teleTANs last night. However, it is a single PDF which gets overwritten every week.

Looking at the number of issued teleTANs: In one week (06/07-13/07) 125 teleTANs have been issued. At the same time, parse_keys.py counted 102 unique users based on the hourly package data which results in a ratio (users counted vs issued teleTANs) of about 82%. I'm very interested where the larger errors comes from (estimated users vs people getting a teleTAN but not sharing their keys). Furthermore, these statistics somehow tell us that the intended way of sharing your keys based on a lab test combined with a QR code is at the moment insignificant.

micb25 avatar Jul 15 '20 20:07 micb25

I think this issue can be closed, now that padding multiplier is set to one on the server. @micb25 do you agree?

daimpi avatar Oct 23 '20 14:10 daimpi