radiocells-scanner-android
radiocells-scanner-android copied to clipboard
Privacy enhancements
So I open this thread to further discuss the privacy issues @gdt brought up in #127 I agree with @gdt and @wish7code that aggregating the cells from a session in one group should improve privacy. Let's see what we can do on the WLAN side.
Some other thoughts:
- Why do we need logins to upload data? Let's make the login optional (like in MozStumbler). Like that anyone could decide whether he wants to be in the rankings or stay relatively anonymous.
- Connect over HTTPS from the Radiobeacon app. I think the client should be updated ASAP. This is to prevent snooping on the data during the upload and to secure our passwords. I bet some chose the same passwords as for their e-mail accounts...
- Is the data provided to everyone exactly the same like we send to the project's servers? Or is it somehow anonymised? I mean could someone create a movement profile of me as well? To secure myself I begun to upload data only when I am at home.
Like @gdt I do care very much about my privacy and would love to see some fast and beneficial improvements here.
Why do we need logins to upload
There's already an anonymous upload endpoint server-side, we setup some time ago for testing purposes. Let's integrate that client-side!
Connect over HTTPS from the Radiobeacon app.
Should be feasible too, although we might need a fallback mechanism based on the Android version. I remembered that Let's encrypt certifcates work with Android >4.0, but I might be completely wrong as according to https://community.letsencrypt.org/t/which-browsers-and-operating-systems-support-lets-encrypt/4394 it should work with > 2.3.6 . Will test...
Or is it somehow anonymised?
It is anonymised in the way that we never expose user names in the downloads. You might find some user names from the years <2014, but in that cases the users willingly decided to expose their user names to improve their upload stats. With the anonymous upload opportunity even a bad server wouldn't have a chance to create movemement profiles.
Nevertheless being a privacy aware individual as Greg and you, I'd like to leave final (personal) risk assessment up to you. Therefore have a look at the actual data at http://radiocells.org/static/wifis2016_raw.tar.gz and cells2016_raw.tar.gz Personally I don't consider this kind of data as a risk for my individual privacy.
Any (state) attacker would use cheaper |MSI catchers, wifi and bluetooth trackers and the like :-)
Starting with https support @ ea55679ecc
I think the biggest privacy issue is that the scan reports are basically a tracklog. There are various ways to mitigate this:
- Only upload the strongest observation for each cell.
- Delay upload (or let the user delay it by choosing when to upload), and for each session, start time at most recent sunday 0000Z, to hide time of actual travel. This hides information about when the user travels, which is half the sensitive data.
- For each session, pick a new random value X [1,1.5). Have a configured threshold D (1-5km). Allow multiple "don't scan" locations. For each "don't scan" location, and for each location that is the start or end of a session, and for each location where the user stops for more than 120s, omit all points that are within X*D. That will get rid of a lot, but it will hide the places that the user actually goes, which for me is half the interesting/sensitive data.
Arguably if the second two are done, the cell max-only scheme isn't needed, and the max-only scheme doesn't really work. The point here is to enable even the paranoid to feel comfortable scanning all the time.
Another point is that "anonymous uploads" isn't good enough. Once you can get a track, and match up a user from one end, you can match up the rest. So while I support anonymous uploads, i don't think it's close to enough.
I realize all of this can be done in the client, so there's nothing stopping me but coding itself...
A few remarks from my side: I do understand that privacy is an issue, as matching a log to a user leads to a detailed profile of where that person went while they were logging. But I also see the other side – that of producing a good cell/wifi coverage map.
- First of all, as @wish7code mentioned to me a while back, the link between a user account and log files is cut the very moment the uploaded file is imported into the database and the user stats have been updated. Once the upload is finished, the logs are basically anonymous.
- Nonetheless, the logs do contain some information that might allow conclusions regarding the user who uploaded them: they contain the make and model of cell phone, Android version and the Radiobeacon versions used to record and upload the log. (Android version is just a number – you can't tell from the logs whether I'm running the stock ROM or a custom one.) Your carrier can also be inferred based on the assumption that in your home country, your phone will only pick up your carrier's cells.
- Besides that, every log contains accurately timestamped position information. Each session is broken down into multiple files, but each file is a piece of a track log.
Not having access to raw measurements would also prevent a couple of things which are needed to make a good map:
- Having raw measurements available allows me to set up my own processing and try out different ways of putting individual measurements together. Think about improving position estimates, reliably detecting moving wifis or wifis that have moved permanently and more. You can't fix these errors in a processed wifi catalog.
- Knowing the hardware and software configuration with which a log was collected is needed to weed out or fix bad data in case of a known bug. For example, very early builds of Radiobeacon had latitude and longitude swapped – thanks to the version information in the log, it is easy to fix those logs.
- Fuzzing timestamps makes it harder to determine when a wifi changed its position – instead of measurements jumping from A to B, they will jump back and forth between these two points for the duration of the tolerance period.
- The start and destination of my trips tend to be the areas in which I get the best measurements, as I get closer to the antennas than a mapper who just passes by on the street.
In a nutshell, there will always be trade-offs between data quality and privacy, and different mappers will have different views on where the line should be drawn, therefore mappers should be given the option to choose how much to share. As for myself, I am well aware of what I'm sharing and have knowingly agreed for this raw data to be released under an open license. If I don't want anyone to know I've been at a particular place at a particular time, I don't log it.
OpenStreetMap has a similar issue when dealing with GPS tracks, and they have resolved it by introducing multiple privacy levels, described here: http://wiki.openstreetmap.org/wiki/Visibility_of_GPS_traces. While the general privacy situation on OSM is similar, some things are different:
- On OSM, there's always the human element between raw GPS tracks and the finished map. One can map using GPS tracks that never get uploaded, or without any GPS tracks at all. Our cell/wifi catalog is the immediate result of our raw measurements.
- OSM doesn't identify the device configuration with which the traces were captured.
- GPS traces can be captured and edited with a much wider range of devices. A savvy user can do some processing prior to uploading, such as eliminating tracks around their home or destinations visited.
My suggestion would therefore be:
- Make all of the following privacy choices user-selectable.
- Suppress device information in the logs.
- Maintain a dynamic location blacklist (when the device remains within a certain range over a certain period of time, also user-selectable).
- Add a random offset to each blacklisted location, proportional to distance. We might want to keep each location's offset fixed: if a location is visited frequently and the offset were determined dynamically each time, the "hole" around it will slowly get narrower, with the target right in the center.
- Randomize timestamps by adding a random offset. The maximum offset needs to be discussed – not too short (to protect privacy) and not too long (to preserve data quality), or could be user-configurable. Timestamps in file names should be randomized in the same manner, to be consistent with file contents. In any case, files with fuzzed timestamps should be marked as such (with a value that is guaranteed to be greater than the offset), so that any time-based processing (e.g. detection of moving wifis) can consider the inaccuracy of these timestamps.
- Break the chronological order of locations: either add different offset to each timestamp, or set all timestamps to a fixed value determined by the start and end of the session (or the chunk of the session in that particular log file). When breaking a log into multiple log files, use location (rectangles or polygons) rather than timestamps for grouping locations.
- Never record the time of export or upload – they serve no purpose for wifi catalog processing.
I say if you have ability to send a pull request, stop talking, a lot of bikeshedding recently.
I'd welcome pull requests from EVERYONE. The problem at my side is that I'm currently lacking the time to bring this issue forward.
@agilob Really be fair! In fact @mvglasow is the only one who has ever sent pull requests and I highly appreciate all his detailled and well thought improvement ideas which served as blueprint for many of our previous implementations