mozilla-vpn-client
mozilla-vpn-client copied to clipboard
VPN-5860: Improve health metrics
Description
The first round of health metrics from Super Dooper dashboards needed improvement:
- We were spamming Glean with events when connections were down, which was too noisy for useful engineering and data insights.
- It was still hard to answer questions like "how long are connections stable for?"
This PR attempts to address this. The current solution was designed in collaboration w/ Yeonjoo (data) and Santiago (product). We're adding 9 new metrics, and removing a bunch as well. We think that some subset of these 9 will be most useful, but aren't quite sure. Thus, these metrics all will live for about 6 months, and we'll then evaluate the data and decide which ones to keep in perpetuity.
These metrics are reworked both in the app and in the Android daemon.
The metrics (3 of each, for stable, unstable, and no connection):
- An event for when the connectivity changes to this state.
- A counter for when a health check returns this state, even if it was already in the state.
- A timing distribution for how long it stays in a given state. (This only exists on Android daemon and desktop, as the mobile apps are frequently relaunched during a VPN session.)
The counters can be compared in relative terms, but should not be considered as markers of time. There is a race condition in the counter, and it cannot be relied upon as having one count per second. In many cases, we conduct a health check due to the timer timeout. Then milliseconds later, a ping returns and we do a health check due to that ping returning. However, if the ping returned before the timeout, the timer would have been cancelled we would have only recorded one. Fixing this is outside the scope of this PR, and the resulting data is still valuable.
I also renamed a couple Android booleans, to make it clear when the silent server switch (or "reconnect") is coming from the app, and when it's coming from the daemon.
I've tested this on macOS, Android, and iOS.
Reference
https://mozilla-hub.atlassian.net/browse/VPN-5860?focusedCommentId=825842
Checklist
- [x] My code follows the style guidelines for this project
- [x] I have not added any packages that contain high risk or unknown licenses (GPL, LGPL, MPL, etc. consult with DevOps if in question)
- [x] I have performed a self review of my own code
- [x] I have commented my code PARTICULARLY in hard to understand areas
- [x] I have added thorough tests where needed
Request for data collection review form
All questions are mandatory. You must receive review from a data steward peer on your responses to these questions before shipping new data collection.
(2 of these metrics were previously approved, so this form is for the one new one.)
- What questions will you answer with this data?
These are a handful of metrics to learn more about how often VPN sessions have connectivity issues. How many times in a VPN session do users have connectivity issues? How long do those issues last for?
- Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements?
We do not have great insight into some of basic questions about our users: How often are they having issues? What sort of issues are they? How long do users keep the VPN active for?
- What alternative methods did you consider to answer these questions? Why were they not sufficient?
Carefully considered which metrics were needed. Additionally had some prior metrics (which are removed in this PR), but they were not robust enough to answer the questions above.
- Can current instrumentation answer these questions?
No, unfortunately. See question 3 for more details.
- List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories found on the Mozilla wiki.
Note that the data steward reviewing your request will characterize your data collection based on the highest (and most sensitive) category.
Measurement Name | Measurement Description | Data Collection Category | Tracking Bug # |
changed_to_no_signal | Event that is recorded when connection status changes to no signal | technical | VPN-5860 |
changed_to_unstable | Event that is recorded when connection status changes to unstable | technical | VPN-5860 |
changed_to_stable | Event that is recorded when connection status changes to stable | technical | VPN-5860 |
no_signal_count | Count of health checks which result in status no signal | technical | VPN-5860 |
unstable_count | Count of health checks which result in status unstable | technical | VPN-5860 |
stable_count | Count of health checks which result in status stable | technical | VPN-5860 |
no_signal_time | Time distribution of how long connectivity status stays in no signal | technical | VPN-5860 |
unstable_time | Time distribution of how long connectivity status stays in unstable | technical | VPN-5860 |
stable_time | Time distribution of how long connectivity status stays in stable | technical | VPN-5860 |
- Please provide a link to the documentation for this data collection which describes the ultimate data set in a public, complete, and accurate way.
This collection is documented in the Glean Dictionary at https://dictionary.telemetry.mozilla.org/
- How long will this data be collected? Choose one of the following:
I want this data to be collected for 6 months initially (potentially renewable). (It's likely that after reviewing 6 months of data, a subset of these metrics will be collected in perpetuity. And some will be removed)
- What populations will you measure?
No filters - all channels, countries, and locales, unless the user has opted out of data collection on that device.
- If this data collection is default on, what is the opt-out mechanism for users?
When launching the app for the first time, a user is given an option for whether to allow data collection. After this initial set up screen, a user can always toggle data collection permissions in the System Preferences screen.
- Please provide a general description of how you will analyze this data.
Dashboards that product managers and others will consult on a regular basis.
- Where do you intend to share the results of your analysis?
Within the Mozilla VPN team.
- Is there a third-party tool (i.e. not Glean or Telemetry) that you are proposing to use for this data collection?
No
Tagging @travis79 for data-review
Please add unit tests for all of the new collections or add to the description of this ticket why that is not possible at the moment with a follow up ticket for implementation of tests in the future.
For Android daemon, quoting a prior PR: "We cannot test classes that load ndk-based native code. So neither vpnservice (wireguard.so) and glean are good candidates for this type of tests." There is an existing ticket for this work.
For the app (and the daemon, as well), we don't have any unit tests that activate/deactivate the VPN, which is needed to test these. There is an existing ticket for better testing of this part of the codebase.
For the Android part, sorry I forgot I asked the exact same thing on your previous PR 😅
For the app (and the daemon, as well), we don't have any unit tests that activate/deactivate the VPN, which is needed to test these.
Given the new code here is not in controller.cpp specifically, but on connectionhealth.cpp and telemetry.cpp, I think it is still worth attempting to unit test it. The ConnectionHealth is a testable class with examples: https://github.com/mozilla-mobile/mozilla-vpn-client/blob/main/tests/unit/testconnectionhealth.cpp. I believe telemetry.cpp would be testable as well, although there are no tests for it at the moment. I'd be happy to pair on this if you'd like.
@brizental ready for another review.
Given the non-standard calls in those tests, we're getting some weird data back for the timespans. I've confirmed this looks clean in the debug pings viewer, and I'm inclined to not spend more time working on these tests (especially with the broken .count).
@brizental I believe I've addressed all your comments.
Testing the number of timespans fails about 10% of the time. When it fails, it's because the test setup has created an extra stable period. I'm not quite sure why this is happening. You'll see I've removed a silent server switch when testing, but I can't figure out why else this might be happening. From using the Glean debug viewer, I'm yet to see any instances of this happening in the actual app - My hunch is that this is a testing setup peculiarity. Do you see anything that might be causing this?
Working on redoing these tests, but need to close the laptop. Will finish this up tomorrow.
@brizental I believe I've addressed all comments, and this is ready for another round of review.
Uh oh! Looks like an error! Details
Client ID static/taskcluster/github does not have sufficient scopes and is missing the following scopes:
{
"AnyOf": [
"queue:rerun-task:mozillavpn-level-1/dzvSh6xGTTKx50AOADeraw/UIyjld8gQESu2C5OOo5vHA",
"queue:rerun-task-in-project:none",
{
"AllOf": [
"queue:rerun-task",
"assume:scheduler-id:mozillavpn-level-1/dzvSh6xGTTKx50AOADeraw"
]
}
]
}
This request requires the client to satisfy the following scope expression:
{
"AnyOf": [
"queue:rerun-task:mozillavpn-level-1/dzvSh6xGTTKx50AOADeraw/UIyjld8gQESu2C5OOo5vHA",
"queue:rerun-task-in-project:none",
{
"AllOf": [
"queue:rerun-task",
"assume:scheduler-id:mozillavpn-level-1/dzvSh6xGTTKx50AOADeraw"
]
}
]
}
- method: rerunTask
- errorCode: InsufficientScopes
- statusCode: 403
- time: 2024-03-08T16:38:02.017Z