govuk-prototype-kit
govuk-prototype-kit copied to clipboard
Measure how many user have updated to V11
What
We want to know how many users (%) have updated their prototype kit to V11.
We think searching Github repo could give us the answer. We may want to repurpose https://github.com/x-govuk/govuk-frontend-component-stats
Why
So we have confidence that the work we are doing is effective and our users are less at risk of the recent security incident.
Who needs to work on this
PA, Dev
Done when
- [x] https://github.com/alphagov/design-system-team-internal/issues/554
- [x] https://github.com/alphagov/design-system-team-internal/issues/555
- [x] Complete DPIA and send to privacy team
- [ ] Get permission from privacy team to do the work
- [ ] Create script to get anonymous data from GitHub
- [ ] Complete privacy office assessment of script
@SaraC19 we just added this to this coming sprint and will need your help on this. I think it might be similar to which components users are using.
I have done some initial work to scrape GitHub for a list of prototypes (found around 1,100) and get the version numbers for each. However I need some help improving the quality of the data and analysing the results.
@lfdebrux has done some seriously very cool stuff with the data. I'm going to play around with it to see if I can clean it up a bit. I'll also make some notes of all the inevitable caveats of the data, but Laurence has done a huge amount of work on it.
We shared some preliminary notes on this work with the team in a meeting yesterday (some meeting notes on Padlet).
One thing we discussed in a little detail was the data privacy aspects, especially the following questions:
- Is it okay to store data scraped from GitHub like this?
- Is there any PII should we avoid storing? For instance, commit email addresses. What about usernames?
- We'd like to be able to create issues/pull requests on outdated repos asking/helping them to update. Is this okay?
- Are there any things we shouldn't do with this data. For instance, GitHub creates an email address for each user that can be used to email them directly, but we think we probably shouldn't use this. Are there other similar things we should avoid, like combining this data with other data?
- Can we share this data beyond the team? What about analytics derived from this data?
We probably need to talk to the Privacy Office or Information Assurance about this, @SaraC19 @trang-erskine would you be interested in joining a conversation/meeting?
I had a chat with a data privacy officer yesterday, they're going to go away and think about it.
I've written up the problem, solution and what we want in a document. I've also outlined the suggested next steps. What we need is currently sat in the database that you've put together, so I'll clean up the data now and then we're pretty much good to go in showing this to IA.
I've popped this spreadsheet together with the metrics on it that I've listed that we need in the write up. I've also linked it up to Data Studio to play with the data and see what the shape of the data is looking like. The filters will need work as we'll need to decide what we actually want the filters to be and what we want to look at, but that can come later.
@lfdebrux has started to write up a DPIA and has booked a meeting with the data privacy officer for Monday 13 Dec to follow up with some questions. I'll join so I can catch up and see how I can help.
We (Fadzai, Sara, and I) just had another chat about this, summary follows.
-
We debated the meaning of the GitHub Acceptable Use Policy section 6 "Information Usage Restrictions", which states that you may use information from the service for research of archival purposes. I made the case that this does not restrict User Generated Content, which is treated separately in the GitHub Terms of Service section D.3 "Ownership of Content, Right to Post, and License Grants". (Note: I realised just as I was writing this that the AUP section 6 doesn't say "only", so maybe it isn't restrictive?). Laurence agreed to email GitHub for clarification on this.
-
Fadzai is generally concerned about whether the GitHub policies allow for this sort of data collection, and has reached out to colleagues from information assurance and CDIO asking for other opinions.
-
There was a quite a lot of back and forth on whether usernames are personal data, (I think they are, based on reading GitHub Privacy Statement "What information GitHub collects"), and whether we can psuedonymise or anonymise them. I explained that we need a owner name (which is a username) and repository name to identify each project uniquely. Fadzai pointed out that even if we removed the owner name from the data you could still identify people from the repository name if the repository had a unique name on GitHub.
-
We are unsure about the legal basis, Fadzai didn't think that it would fit into public task so we would need to argue legitimate interest, but would check this again.
-
We had a look through the question 1.2 on the DPIA form, we agreed that the project was NOT using new technology, was NOT using systematic monitoring, but were unsure about whether it would be considered large scale data processing or not. Fadzai suggested that if we were not using personal data then we the DPIA would be very light touch.
Next steps are for Sara and I to look at how we can limit scope to reduce impact, Fadzai to get responses from other teams, and then we'll have a meeting.
I've sent a support request to GitHub asking about the accessible use policy:
https://support.github.com/ticket/personal/0/1428277
@SaraC19 and I just had a chat with @zilnhoj about getting access to BigQuery, where GDS (including GOV.UK) stores a lot of its analytics data. It would be good if we can also store data we collect there, so we know its secure and held centrally. I've added a few tickets about this to the project board.
This ticket itself is turning into a bit of an epic :/ We're thinking we should keep the scope down to making a script that collects anonymous data, although we'll still need to do an assessment with the privacy office for that. Maybe we should create another ticket to capture thoughts about any future GitHub analytics work.
Trang, Laurence and I gave an unofficial deadline of end of this week (Friday 19 Feb) to finish the DPIA for privacy to get this moving. I've finished up the DPIA (if you'd like access, let me know and I'll grant it). Laurence to add some parts to explain more about the script and then he and Trang to review the whole document.
We need to clarify whether or not we want to include in this particular DPIA if we want to get permission to contact users from GitHub.
DPIA has been sent to Fadzai and waiting for feedback
Followed up with Fadzai who today said:
Information Assurance has been asked to review the processing as well and we are waiting for them to get back to us. They are generally a bit busy so I'm not sure how long they will be.
So now we are waiting to hear back from IA and Fazai.
Fadzai got back to us.
I think I mentioned this before, that I don't see how this can be classified as legitimate interest - having the contacts for reporting security breaches AS WELL AS to undergo analytics to improve performance are not the same thing and do not align. ... Can you complete the legitimate interest test and along with the DPIA, I will pass it to Murat for review. ... The following will need to be done if this goes forward.
- privacy notice would need to be updated as a necessity (for new users)
- change of terms notice would need to be created as a necessity (for existing users)
@trang-erskine is going to look at completing the legitimate interest test. Until that's done we'll have to put this in blocked.