safeplaces-dct-app icon indicating copy to clipboard operation
safeplaces-dct-app copied to clipboard

CONCERN: Relative rather than absolute time risks missing overlaps

Open diarmidmackenzie opened this issue 4 years ago • 13 comments

I think this might be a bug, but that depends on the exact algorithms we use for collision detection, and I don't actually know what those are I would appreciate review from someone who knows the implementation better.

Perhaps our collision detection algorithm is sophisticated enough to avoid this kind of problem?
(Do we have documentation of our collision algorithm?)

A and B meet for a walk.

A's phone records data every 5 mins starting at 12:00

B's phone records data every 5 mins starting at 12:02.

They walk at 6kph = 100m/minute.

A points recorded: 12:00, 0 12:05, 500 12:10, 1000 etc.

B points recorded: 12:02, 200 12:07, 700 12:12, 1200 etc.

By a simple naive comparison of individual points, we will never see A and B in the same place at the same time.

In fact they spent a long time in each others' company.

A possible solution would be logging at absolute times: we check everyone's phone at exactly 12:00, 12:05, 12:10 etc.

Another alternative would be some sophisticated position interpolation algorithm (which I have not worked out in detail).

How does our current implementation stack up against this kind of example?

diarmidmackenzie avatar Apr 14 '20 20:04 diarmidmackenzie

We account for this. We compare using windows of time and space -- you must be within a 4 hour window and also within a 20m window.

So this is not a problem.

penrods avatar Apr 17 '20 16:04 penrods

Thanks - is the algorithm documeented somewhere? (if not I will find a place to document).

And is there any story about how it was derived? Why 20m? Why 4 hours.

(I thought I had read 70m somewhere, but maybe misremembered).

Is that +/-4 hours, or +/-2 hours? or is the interval asymetric? (not much point in matching an hour before the infected person there, but potentially value in matching 3-4 hours after the infected person was there).

diarmidmackenzie avatar Apr 17 '20 16:04 diarmidmackenzie

Steve, with the algorithm you describe, I think this problem will exist.

B is never observed within 20m of 0m, 500m, 1000m.

A is never observed within 20m of 200m, 700m, 1200m.

If I understand your algorithm correctly, then there will be no point at which a collision is detected.

Don't seem to have permissions to re-open, but I am unconvinced that we don't have a problem here.

diarmidmackenzie avatar Apr 17 '20 22:04 diarmidmackenzie

The distance was approx 70' previously, but has been switched to 20m (65') in the latest round of code. Yes, this is somewhat arbitrary, but is based on the accuracy of a standard geohash, which at one time appeared to be a likely approach for encryption. So I went with that number.

The 4 hour window is also somewhat arbitrary, but was based on some information about the survival of the virus on surfaces. This likely needs further validation, but nobody has pushed back when I've explained it to healthcare professionals.

And yes, you are correct that someone moving past each other at a rate where the two individuals cross paths briefly can be overlooked. Generally speaking, healthcare professionals don't worry about < 5 minutes of contact so that is an unimportant worry to them.

penrods avatar Apr 20 '20 11:04 penrods

Steve - thanks for the info.

On the 4 hour window: I'd expect an asymmetric algorithm there, since you can't pick up COVID from an infected person if you go to a place before they do).

So:

  • 5 mins before (just to cover timestamp inaccuracy)
  • Up to 4 hours after.

On the specific concern I was trying to capture here, I think you may have missed the point. This is not about people crossing paths.

In this situaton 2 people can go for a walk together, spending 1-2 hours or more in the exact same place as each other, but as long as they move > 10m/minute (which is a pretty slow walk), they will never detect a match in the trails, if their timers are popping in anti-phase.

An evisionable solution (i.e. a "bug fix") might be to trace a linear path from each point to the next in time, and also look for matches every 20m or so along that linear path. This works for the "walk" example. It doesn't work for the case where someone travels by subway, and pops up 20 mins later, 3 miles away. Redaction probably plays a role in helping with the subway examples.

The question of how we test whether these algorithms are correct is a really interesting one. As well as thought experiments like the one above, we are developing some really interesting ideas, both around Testing in Production, and Synthetic data sets, and we have some people with substantial prior expertise in testing location based apps.

If you want to know more about what we are planning here, reach out & we can chat about it.

diarmidmackenzie avatar Apr 20 '20 12:04 diarmidmackenzie

This is a valid concern with off-cycle location sampling every 5 minutes (as opposed to logging upon every motion).

However, I feel it is a low priority feature enhancement. I suspect that it may be addressed as a side effect as we re-work the cadence of GPS recording(#454), investigate using accelerometer data to determine WHEN we GPS log (#403) etc.... [I link to those issues so that people working on them may review this one as well.]

I'm not super worried about the example because if you go on a walk with somebody for any length of time, you presumably know them and can tell the contact tracer their name/contact info when questioned about who you spent time with in the last 2 weeks. The GPS log mostly assists with jogging the users memory about where/when they went, as well as catching people the user didn't know/notice were near them, or subsequent contact due to contact/surface contamination.

summetj avatar May 01 '20 16:05 summetj

Modifying the algorithm to sample at 5 minute intervals tied to absolute time (12:00, 12:05, etc....) seems like an easy fix that could be applied with minimal changes to the codebase, so I'm going to label this "good first issue", but if somebody picks it up, they should ask people in #appdev on slack to make sure that absolute vs relative time change doesn't mess something else up before putting effort into it.

summetj avatar May 01 '20 16:05 summetj

Agreed, care needed before making any changes here. Also note that some of the Safe Places data will come from other sources (e.g. Google Takeout), where we can't assume logs are on rigid 5 minute boundaries, so just fixing what the APp outputs is not a complete solution here.

I am doing some work with some data science people to do some modelling to determine how good/bad the current algorithm is in a range of scenarios. I think we should complete that work before we make any significant changes here.

Re: comment above, about how you probably know someone you go for a walk with, going for a walk is only an example. Could also be: sitting on a bus next to someone (where it's quite likely you don't know them).

diarmidmackenzie avatar May 01 '20 16:05 diarmidmackenzie

... should ask people in #appdev on slack to make sure that absolute vs relative time change doesn't mess something else up before putting effort into it.

@summetj Could you please add me to the slack workspace? Alternatively, point me to the correct person to ask. Email is [email protected].

henryrossiter avatar May 04 '20 19:05 henryrossiter

Hi Henry,

I will add you to Slack now, but please also fill out the volunteer intake form (where we normally add you from) so you get all of the other onboarding information as well. https://docs.google.com/forms/d/e/1FAIpQLSdzqAxrlrxb_HqLh1KXnfPu1rse4aByS2krL1OYlN3qKChyqA/viewform

summetj avatar May 04 '20 20:05 summetj

After examining the data that my phone logs, I think this might be futile. The intervals are not uniform, so even if two phones start geolocation logging at the exact same time, they will likely won't be synchronized after just a few intervals.

I set this.locationInterval = 6000 * 3, meaning we want geolocation every 3 minutes. However, the actual data points recorded show there's anywhere from ~3 - ~6 minutes between each data point.

henryrossiter avatar May 11 '20 13:05 henryrossiter

@diarmidmackenzie - Unless the cadence issue can be fixed, it looks like setting things to an absolute time won't work. I'm considering closing this issue, have you seen any potential improvements on the cadence front?

summetj avatar May 11 '20 14:05 summetj

@diarmidmackenzie can we close this out?

Patrick-Erichsen avatar May 22 '20 17:05 Patrick-Erichsen