govuk-prototype-kit Measure how many user have updated to V11

What

We want to know how many users (%) have updated their prototype kit to V11.

We think searching Github repo could give us the answer. We may want to repurpose https://github.com/x-govuk/govuk-frontend-component-stats

Why

So we have confidence that the work we are doing is effective and our users are less at risk of the recent security incident.

Who needs to work on this

PA, Dev

Done when

[x] https://github.com/alphagov/design-system-team-internal/issues/554
[x] https://github.com/alphagov/design-system-team-internal/issues/555
[x] Complete DPIA and send to privacy team
[ ] Get permission from privacy team to do the work
[ ] Create script to get anonymous data from GitHub
[ ] Complete privacy office assessment of script

Nov 23 '21 12:11 trang-erskine

@SaraC19 we just added this to this coming sprint and will need your help on this. I think it might be similar to which components users are using.

Nov 23 '21 12:11 trang-erskine

I have done some initial work to scrape GitHub for a list of prototypes (found around 1,100) and get the version numbers for each. However I need some help improving the quality of the data and analysing the results.

Nov 26 '21 12:11 lfdebrux

@lfdebrux has done some seriously very cool stuff with the data. I'm going to play around with it to see if I can clean it up a bit. I'll also make some notes of all the inevitable caveats of the data, but Laurence has done a huge amount of work on it.

Nov 29 '21 13:11 SaraC19

We shared some preliminary notes on this work with the team in a meeting yesterday (some meeting notes on Padlet).

One thing we discussed in a little detail was the data privacy aspects, especially the following questions:

Is it okay to store data scraped from GitHub like this?
Is there any PII should we avoid storing? For instance, commit email addresses. What about usernames?
We'd like to be able to create issues/pull requests on outdated repos asking/helping them to update. Is this okay?
Are there any things we shouldn't do with this data. For instance, GitHub creates an email address for each user that can be used to email them directly, but we think we probably shouldn't use this. Are there other similar things we should avoid, like combining this data with other data?
Can we share this data beyond the team? What about analytics derived from this data?

We probably need to talk to the Privacy Office or Information Assurance about this, @SaraC19 @trang-erskine would you be interested in joining a conversation/meeting?

Dec 02 '21 09:12 lfdebrux

I had a chat with a data privacy officer yesterday, they're going to go away and think about it.

Dec 03 '21 12:12 lfdebrux

I've written up the problem, solution and what we want in a document. I've also outlined the suggested next steps. What we need is currently sat in the database that you've put together, so I'll clean up the data now and then we're pretty much good to go in showing this to IA.

Dec 08 '21 14:12 SaraC19

I've popped this spreadsheet together with the metrics on it that I've listed that we need in the write up. I've also linked it up to Data Studio to play with the data and see what the shape of the data is looking like. The filters will need work as we'll need to decide what we actually want the filters to be and what we want to look at, but that can come later.

Dec 10 '21 16:12 SaraC19

@lfdebrux has started to write up a DPIA and has booked a meeting with the data privacy officer for Monday 13 Dec to follow up with some questions. I'll join so I can catch up and see how I can help.

Dec 10 '21 17:12 SaraC19

We (Fadzai, Sara, and I) just had another chat about this, summary follows.

We debated the meaning of the GitHub Acceptable Use Policy section 6 "Information Usage Restrictions", which states that you may use information from the service for research of archival purposes. I made the case that this does not restrict User Generated Content, which is treated separately in the GitHub Terms of Service section D.3 "Ownership of Content, Right to Post, and License Grants". (Note: I realised just as I was writing this that the AUP section 6 doesn't say "only", so maybe it isn't restrictive?). Laurence agreed to email GitHub for clarification on this.
Fadzai is generally concerned about whether the GitHub policies allow for this sort of data collection, and has reached out to colleagues from information assurance and CDIO asking for other opinions.
There was a quite a lot of back and forth on whether usernames are personal data, (I think they are, based on reading GitHub Privacy Statement "What information GitHub collects"), and whether we can psuedonymise or anonymise them. I explained that we need a owner name (which is a username) and repository name to identify each project uniquely. Fadzai pointed out that even if we removed the owner name from the data you could still identify people from the repository name if the repository had a unique name on GitHub.
We are unsure about the legal basis, Fadzai didn't think that it would fit into public task so we would need to argue legitimate interest, but would check this again.
We had a look through the question 1.2 on the DPIA form, we agreed that the project was NOT using new technology, was NOT using systematic monitoring, but were unsure about whether it would be considered large scale data processing or not. Fadzai suggested that if we were not using personal data then we the DPIA would be very light touch.

Next steps are for Sara and I to look at how we can limit scope to reduce impact, Fadzai to get responses from other teams, and then we'll have a meeting.

Dec 13 '21 15:12 SaraC19

I've sent a support request to GitHub asking about the accessible use policy:

https://support.github.com/ticket/personal/0/1428277

Dec 15 '21 09:12 lfdebrux

@SaraC19 and I just had a chat with @zilnhoj about getting access to BigQuery, where GDS (including GOV.UK) stores a lot of its analytics data. It would be good if we can also store data we collect there, so we know its secure and held centrally. I've added a few tickets about this to the project board.

This ticket itself is turning into a bit of an epic :/ We're thinking we should keep the scope down to making a script that collects anonymous data, although we'll still need to do an assessment with the privacy office for that. Maybe we should create another ticket to capture thoughts about any future GitHub analytics work.

Dec 17 '21 11:12 lfdebrux

Trang, Laurence and I gave an unofficial deadline of end of this week (Friday 19 Feb) to finish the DPIA for privacy to get this moving. I've finished up the DPIA (if you'd like access, let me know and I'll grant it). Laurence to add some parts to explain more about the script and then he and Trang to review the whole document.

We need to clarify whether or not we want to include in this particular DPIA if we want to get permission to contact users from GitHub.

Feb 17 '22 09:02 SaraC19

DPIA has been sent to Fadzai and waiting for feedback

Feb 21 '22 12:02 SaraC19

Followed up with Fadzai who today said:

Information Assurance has been asked to review the processing as well and we are waiting for them to get back to us. They are generally a bit busy so I'm not sure how long they will be.

So now we are waiting to hear back from IA and Fazai.

Mar 04 '22 09:03 SaraC19

Fadzai got back to us.

I think I mentioned this before, that I don't see how this can be classified as legitimate interest - having the contacts for reporting security breaches AS WELL AS to undergo analytics to improve performance are not the same thing and do not align. ... Can you complete the legitimate interest test and along with the DPIA, I will pass it to Murat for review. ... The following will need to be done if this goes forward.

privacy notice would need to be updated as a necessity (for new users)

change of terms notice would need to be created as a necessity (for existing users)

@trang-erskine is going to look at completing the legitimate interest test. Until that's done we'll have to put this in blocked.

May 18 '22 10:05 lfdebrux

govuk-prototype-kit govuk-prototype-kit copied to clipboard

Measure how many user have updated to V11

What

Why

Who needs to work on this

Done when

govuk-prototype-kit
govuk-prototype-kit copied to clipboard