boinc GDPR: Make stats export more functionable while keeping users safe

Recently there was a discussion about the stats export and the GDPR (among the others me and @davidpanderson were a part of this discussion).

We identified that initially the implementation of the GDPR compliance was made too strict, that prevents data aggregators to show the correct statistics of the BOINC network. In particular, new users will never be shown in any statistics export by default if they not enable stats export manually. However, statistics that is exported by the BOINC Projects contains no personal information (with two exceptions explained below) that might identify the BOINC user is one or another way.

Thus I propose next:

[ ] Enable statistics export of all users and all hosts by default
[ ] Rename current statistics export option to 'Not include personal information into the exported statistics' (better wording here is required)
[ ] For those users who don't want to export their personal information, do not include 'name' and 'url' fields (or make them empty) when doing users stats export
[ ] For those users who have their hosts hidden, do not include <userid> tag (or make it empty) when doing hosts stats export (mostly duplicate ticket #3766 is closed)

Oct 16 '23 17:10 AenBleidd

Have we decided what constitutes private information? For example: user name. When we ask for it, we call it 'screen name', which implies that it's public. It's shown on the project web site, and can't be hidden. Similar with URL, country, and user ID. We show these on the project web site - why not show them in stats export too?

Oct 17 '23 00:10 davidpanderson

We identified that initially the implementation of the GDPR compliance was made too strict, that prevents data aggregators to show the correct statistics of the BOINC network.

Devil's advocate speaking now: "so what?". The aggregators aren't a necessary component that a BOINC project needs to fulfill its service. It's a totally independent entity that can't expect anything from a BOINC project or its registered users by default. Just to make sure: I'm not arguing about what might be nicer for the BOINC ecosystem or not; I'm just discussing this from a GDPR perspective, since that's the legal basis, whether we like it or not. More on the issues related to data transfers and defaults below.

Rename current statistics export option to 'Not include personal information into the exported statistics' (better wording here is required)

This at least needs to be accompanied by a privacy policy that details it the other way round: you have to say what you do export, not what you don't. As is, such an opt-out violates GDPR's "transparency" and "data protection by design and by default" principles.

We show these on the project web site - why not show them in stats export too?

There's a difference between agreeing to publish certain details on a single project one willingly signed up for and having those details transferred to various third parties - without consent. Opt-out doesn't constitute an informed consent. Also, from the data controller's (i.e project's) point of view, what is the legal basis for this data transfer, if not consent? Legitimate interest? Very unlikely (see first statement above).

Furthermore, exactly what kind of entity is a stats site (or any stats export recipient) in the GDPR framework? Since a project ("controller") transfers data for a kind of service augmentation (or does it really?), it's probably a "processor", potentially in a third country or even outside the EU. Just have a look at the can of worms that already opens.

Have we decided what constitutes private information?

This isn't really up to us to decide; there's a legal definition for it. Screen name, URL, country and user ID can potentially be used to identify a natural person, indirectly or even directly. So we need to tread carefully here since that definition is, as with many legal definitions, not cut crystal-clear or tailored to every single use case. How could it be? That's why we have courts after all...

As a project I don't want to put myself into a tenuous position just by hoping for my correct definition of personal data. If you think that's overly cautious, let me tell you that there's an industry out there which sole business it is to find data controllers with loopholes in their data processing, privacy policy, etc. - and sue them for profit, which works because of GDPR's large fines. This is the reason why we have such an elaborate, yet transparent and comprehensible privacy policy.

Honestly, coming back to the beginning, why should I take that risk? For the data aggregators only? I mean, if my users want to export their data, they can and will do it, by informed consent. If not, then they don't want to, or they simply don't care. How does that constitute a problem? In other words: what's the actual problem you're trying to solve that's worth wandering into such treacherous territory?

Oct 17 '23 09:10 brevilo

@brevilo

This at least needs to be accompanied by a privacy policy that details it the other way round: you have to say what you do export, not what you don't

That is a very good point. We have a document that describes the data we are exporting: https://github.com/BOINC/boinc/wiki/XmlStats In this particular ticket I highlighted the exact parts that we're gonna change.

This isn't really up to us to decide; there's a legal definition for it.

This is true however the data that we're exporting is a non-personal data, and it can't be used to identify the person using it. You need users' consent to export personal data.

If you can point which data from your perspective is a personal data except the one I already mentioned - please do that.

In other words: what's the actual problem you're trying to solve that's worth wandering into such treacherous territory?

Currently, the way the GDPR compliance was implemented, it prevents us from seeing the picture of our users. The way BOINC designed, is that it's a distributed network of a project, and we can't see who are our users without asking them to export the data that is not a sensitive personal data: we don't see how much users have some particular OS, etc. You can identify the user by their stats, you can't do that using CPID as well, there is no connection between the data we have and any personal data of our users.

@davidpanderson

Have we decided what constitutes private information?

The data I highlighted in the one that could contain personal information about our users. That doesn't mean that it has it, but it might have it, that's why it's important to not export it by default.

Oct 17 '23 10:10 AenBleidd

In my understanding the GDPR doesn't allow an "opt-out", any personal data which is publicly visible and in particular shared externally needs to be hidden by default. "personal data" here means anything that could possibly be traced back to or help to identify a person.

I don't think it is a "valid interest" of "aggregators" to gather information about each and every host or user that doesn't want to share it, not even for the time between signing up for a project and finding out how to hide his information. Aggregators will not delete any such information once they have it (unless explicitly asked for, which is another hurdle).

What exactly is the goal here?

If the goal is to just gather statistics of e.g. number of host, users, total credit and RAC of a project or the whole of BOINC, projects could publish that aggregated statistics and stats sites could show that without violating the GDPR, as these allow no tracing back to individuals (well, for projects with a reasonable number of participants).

Oct 17 '23 11:10 bema-aei

If you can point which data from your perspective is a personal data except the one I already mentioned - please do that.

What I'm trying to get across is that this isn't only about those data that I already deem to be personal data, but also those that might become personal data when combined with any other data out there. For example, for the user stats I would exclude not just name and url but also id, country, cpid, teamid and has_profile. I'm not saying those data clearly are personal data but I can't confidently deny they could ever be used to help identify a person. There have been enough examples of such cases, even without LLMs and security/data breaches.

Currently, the way the GDPR compliance was implemented, it prevents us from seeing the picture of our users. The way BOINC designed, is that it's a distributed network of a project, and we can't see who are our users

Understood, but in terms of GDPR this is irrelevant. I understand "us" and "we" in your statement as referring to the BOINC community as a whole or the BOINC (software) project itself. Please understand that those aren't the data controllers. It's the individual projects who are the data controllers and who thus carry all duties, responsibilities and legal consequences. Strictly speaking, I could even argue that the current proposal might loosen important legal obligations for projects downstream, potentially without them being fully aware.

I don't think it is a "valid interest" of "aggregators" to gather information about each and every host or user

The correct GDPR term would be "legitimate interest" but, as I said above, that still misses the point. They could gather their own data, on their own legal basis. But what we're discussing here is that the projects transfer those data to them. That's an entirely different scenario.

projects could publish that aggregated statistics and stats sites could show that without violating the GDPR

Exactly. And that's what we do. And we add those individual users who gave their informed consent. I still think that this (the current situation) is most in line with the principles of the GDPR.

Oct 17 '23 11:10 brevilo

projects could publish that aggregated statistics and stats sites could show that without violating the GDPR

Exactly. And that's what we do.

I don't think so, at least not in the stats export. Currently this includes only host and users which have given their explicit consent, but no overall project statistics.

Einstein@Home publishes some statistics on the server status page, but this requires additional knowledge (e.g. how the computing power is actually derived from the project's RAC) and is not standard among projects.

Oct 17 '23 11:10 bema-aei

@brevilo, id, cpid, team_id, etc can't be used to get the personal data of any user. This is not a SSN or similar. These are just identifiers that can identify the set of data but not the person behind it

Oct 17 '23 12:10 AenBleidd

Regardless of the GDPR I find it nowadays questionable to require anyone to share any information that he might not want to share (for whatever reason), here in BOINC not only with the single project he is in direct contact with, but also beyond that. At least for him this is linked to him personally, and if e.g. it's a pretty unusual host, it can still be traced back to him.

Instead of opening all information that we consider not to be personal to virtually everyone, at the risk of violating GDPR or scaring people away, I'd rather like to know what exact information (OS? CPU?) on what level (project, BOINC) is lacking, and find a way to collect it in a way that is certainly compliant to GDPR and individual people's preferences.

So, what information do you think is necessary and missing? (@brevilo : What's the official equivalent for "Datensparsamkeit")?

Oct 17 '23 13:10 bema-aei

I think my position boils down to: BOINC is volunteer computing, and if we want to retain volunteers, we should try to satisfy their wishes and needs before ours (here: do what we might be allowed to).

Oct 17 '23 13:10 bema-aei

computing power is actually derived from the project's RAC

Which is reverse engineering at its worst, because RAC (and credit as a whole) is neither controlled or normalised.

just identifiers that can identify the set of data but not the person behind it

User ID and Team ID together could be used, in most cases, to identify the user name and team name. If the team allows open joining (and many of the big ones do), a bad actor could join the same team and see that member's postings in the team message board - where, in my experience, people may feel more relaxed in disclosing personal information. My team has certainly organised "in real life" meet-ups for drinks in a pub, or weekends in the hills.

Oct 17 '23 13:10 RichardHaselgrove

Recently there was a discussion about the stats export and the GDPR (among the others me and @davidpanderson were a part of this discussion).

I was not party to those discussions. I am confused about both the context and the details. I would be grateful if proponents could address the following.

To get global stats, exporting aggregate data is sufficient, and removes GDPR risk. Who is proposing to change this, and why?

If we go beyond exporting aggregate data, then GDPR adds complications:

Which part of stats data (example: CPID) is personal data? (The only way to know for sure is via a legal court decision.)
Relevant for defining "personal data": how might it be combined with other data (not necessarily from our projects) to identify individuals.

Since we can not say for sure which data is "personal", the sensible approach is to stay on the safe side of GDPR, which argues for opt-in rather than opt-out. This is also consistent with GDPR core principles such as transparency and data minimization. Another motivation to stay on the safe side: GDPR imposes additional requirements about data transfer to third parties (such as stats sites), especially if they are outside the EU.

Cheers, Bruce

Oct 17 '23 13:10 ballen4705

I don't think so, at least not in the stats export. Currently this includes only host and users which have given their explicit consent, but no overall project statistics.

@bema-aei Just to make sure, I'm talking about tables.xml which we do publish and which contains aggregate figures. Does that not include all users and hosts? For instance, nusers_total appears to match our "participants with credit" (on the SSP), so that would almost certainly include users who haven't opted-in the stats export, I think.

Oct 17 '23 13:10 brevilo

What's the official equivalent for "Datensparsamkeit")?

@bema-aei It's "data minimisation" in conjunction with "purpose limitation" and "storage limitation".

Oct 17 '23 13:10 brevilo

For instance, nusers_total appears to match our "participants with credit" (on the SSP), so that would almost certainly include users who haven't opted-in the stats export, I think.

Oh, you're right. I thought this was just the number of the entries in user.xml (which it probably was before user.xml was filtered by consent). My bad. So there is already some aggregated statistics that we publish.

So again: what's missing and desired? And what for?

Oct 17 '23 13:10 bema-aei

id, cpid, team_id, etc can't be used to get the personal data of any user. This is not a SSN or similar. These are just identifiers that can identify the set of data but not the person behind it

@AenBleidd Sorry, but I beg to differ. They probably aren't readily personal data on their own but they might be used with other data, as others said before as well. With regards to the GDPR, unless I can't prove that the cpid can't be used to help identify a person, I really want to treat them as (potentially) personal data and thus as confidential. Since I obviously can't prove that, that's what I'd do.

Also, like @bema-aei just said, I think we ought to ask those questions the other way round: why should I publish id and team_id anywhere? Those are internal identifiers that have no meaning on their own. Yet they will get meaning when combined with other data in some ways - including those I didn't yet think of.

Bottom line: we should only ever store, process and transfer data that serve a purpose (for the data controller's services) and that we (as projects, a.k.a data controllers) have a legal basis for. That's the spirit and purpose of the GDPR and adhering to these will make the difference if someone sues you. We as the ones bearing the actual responsibilities, together with experts in the field, have spent a considerable amount of time on this topic over the years.

I hope BOINC does not weaken the current implementation for the sake of (smaller) projects who can't afford dealing with this matter on their own on such level of detail. Please help them to reduce their attack surface as much as possible by keeping safe defaults.

Oct 17 '23 14:10 brevilo

Ok, let's think about this: If we take an id, what kind of personal information could you take from it? Email address? No Real name? No Phone number? No SSN? No IP address? No These five above are the personal and sensitive data. But not all the other information. You have a very unique host? You can hide it, and it won't be exposed. You need unique anonymous id to avoid duplicates. And you can't use the aggregated statistics, because 100 users of the Project A and 100 users of the Project B doesn't give you the real number of unique users.

Oct 17 '23 14:10 AenBleidd

@AenBleidd could you please elaborate on what statistical information you are missing and for what purpose you need that? You will have to specify that anyway for the data policy declaration. Remember that by GDPR you are required to not collect and process (let alone publish) information that doesn't serve a legitimate and documented purpose.

After that we may discuss how to collect that without violating the GDPR or the wishes and standards of our volunteers.

Stretching and bending the regulations of the GDPR or its interpretation to suit (currently undisclosed and possibly even future) desires of some of us (which also aren't disclosed to me yet) is something I consider the wrong approach and that I don't really feel comfortable with.

Oct 17 '23 14:10 bema-aei

@bema-aei, @brevilo

could you please elaborate on what statistical information you are missing and for what purpose you need that?

That is quite a clear answer, from my point of view: we need to know:

number of unique users
number of unique devices/hosts
number of users in a team/without a team
world distribution (number of users in every country)
age of the account (when the account was created)
total amount of credits of the user/host
average amount of credits of the user/host
OS type of the host
OS version of the host
CPU type of the host
NUmber of CPUs of the host
GPU types of the host (if any)
Number of GPUs on the host (if any)
BOINC version of the host
VirtualBox version (if installed) of the host
RAM available on the host
Hard Disk space available on the host
User ID (index in the database, not unique across the projects)
User CPID (needed to avoid data duplication, MD5, no personal data can be retrieved from it)
Host ID (index in the database, not uniques across the projects)
Host CPID (needed to avoid data duplication, MD5, no personal data can be retrieved from it)

As you can see from the list above, none of the information could be used to identify the person. What is more important, all this information (excluding User CPID and Host CPID) is publicly shown on any Project, but none of it really contains any personal data. The only field (that is not in the list above but was mentioned in the original message) is the 'name', that is actually not the real name of the user (unless they put it there) but a screen name, that could be literally anything. You can go to your profile, put there my name, but this will not make this account mine, and not will impersonate me in any way.

Remember that by GDPR you are required to not collect and process (let alone publish) information that doesn't serve a legitimate and documented purpose.

We're not gonna collect any other information but just the one that is already there and collected for years. And still, there is no any personal and/or sensitive information here.

unless I can't prove that the cpid can't be used to help identify a person

BOINC doesn't collect any personal information, so you can't use CPID to get any of it. CPID is the unique identifier that have sense only within the BOINC, but since there is no personal information in BOINC - you can't identify the user by using their CPID.

I think we ought to ask those questions the other way round: why should I publish id and team_id anywhere?

ID and TEAM_ID are just the indexes in the database, and they are even not unique across the projects. You can basically use the enumeration and load project page to get all the profiles of the users in that particular project. Exporting this data will not disclose anything that is not available publicly now.

we should only ever store, process and transfer data that serve a purpose

That's a very correct point! We have a data that is anonymous and have a purpose. This information could be used by BOINC to get a clear picture about our userbase, and provide them a better service, also this data could be used by third-party data aggregators to show a valuable but still anonymous statistics (and possibly do some other useful stuff). Exposing this information doesn't open any vulnerabilities, and can't target any BOINC user in any way.

We as the ones bearing the actual responsibilities, together with experts in the field, have spent a considerable amount of time on this topic over the years.

At the moment of the GDPR implementation, it was read incorrectly, and all the information was treated as personal (including the posts on the forums) but eventually it was clearly defined what is the information is personal and can't be exported and what is the non-personal information. All the data I wrote above is not personal and completely anonymous.

Which part of stats data (example: CPID) is personal data?

None of it: https://europa.eu/youreurope/business/dealing-with-customers/data-protection/data-protection-gdpr/index_en.htm#shortcut-2

User ID and Team ID together could be used, in most cases, to identify the user name and team name.

You don't need exported data to get this information, you can just go to the Project web page and scrape the data by just going from '0' to 'infinity' as the ID and/or TEAM_ID.

a bad actor could join the same team and see that member's postings in the team message board

Bad actor can do that without using the exported data, because in this case it will not give any additional personal information to them. E.g. you will know that one of the users has one of the hosts running Windows 10. So what? WIll you target them with the ads to buy MacOS or what?

What I have seen from the initial discussions here on the first implementation of GDPR, is that people had no understanding what is the GDPR about. Now, years later, this topic became more clear and obvious. Even now I see that some of you are very scared about this, but if you dig a little bit deeper into the topic, you will clearly see that there is nothing scary at all, and that the GDPR is not so strict as you think about it.

And please keep in mind one very important topic: BOINC is providing software solution that is open source, and thus you should this proposal as an optional recommendation. Yes, we plan to implement this change, but if any of you thinks that it's too dangerous for you and you're too afraid of the GDPR - you may not follow it, and patch it to be exactly the same as before.

Oct 17 '23 22:10 AenBleidd

That is quite a clear answer, from my point of view: we need to know:

While I can guess the reason for collecting some of that data across projects (RAC, number of hosts and users), the purpose for most of these is not clear to me. Why e.g. has the internal ID of a host or user that has absolutely no meaning outside the project to be exported elsewhere? Most of these data items make sense to me in the context of the project, mostly for assigning "work" to a host, but what is e.g. the RAM or disk space available at last scheduler contact of a host needed for outside the project?

Oct 17 '23 22:10 bema-aei

@bema-aei,

Why e.g. has the internal ID of a host or user that has absolutely no meaning outside the project to be exported elsewhere?

I might agree on this, but from the other hand this is a completely anonymous information, and can't make any harm. You can find a good use of it if needed (even if I currently don't see a good example how it could be used).

but what is e.g. the RAM or disk space available at last scheduler contact of a host needed for outside the project?

You can use the 'last scheduler contact' to get a list of the users that were active between two data exports, and thus you can see the dynamics (e.g. if you see that yesterday there was 100 people active and today 100 people active, that doesn't meant that these are the same 100 people, maybe it's 80 same people, 20 of them gone and there are 20 completely new people).

Speaking about the RAM and hard disk space, imagine you want to run a completely new project. And you want to understand, if your application uses 10 GB of RAM, are there will be enough users that could run your Project? Of if you want to save 100GB of data, are there any sufficient amount of the users who could ever run this Project's application?

Oct 17 '23 23:10 AenBleidd

The GDPR requires you to collect, process and in particular export only information that is needed for documented legitimate reasons and purpose. The possibility of a purpose you may think about in the future isn't a valid justification IMHO, and the fact that you can't think of any harm that it may do certainly isn't either.

Oct 17 '23 23:10 bema-aei

The currently available RAM and even more so disk space doesn't help you much when deciding upon a new application or project, there are e.g. user preference settings that influence what is actually usable etc., and a lot of information that I consider even more important (CPU features, GPU properties like CUDA CompCap or OpenCL level) aren't even stored in the DB.

Oct 18 '23 00:10 bema-aei

First a question: Has anyone involved i this discussion spoken to the Information Commissioner in an EU country? Background to the question:- I spent some time talking to a couple about the use and storage of videos and found them to be very clear and helpful in establishing the local "policies" which set out the what, how and when for the recordings, these discussions also showed how many people have wrong ideas about GDPR's scope and intent.

Oct 18 '23 06:10 robsmith1952

@robsmith1952, we had two people in the discussion who have a good and quite recent experience with the GDPR.

The GDPR requires you to collect, process and in particular export only information that is needed for documented legitimate reasons and purpose.

You are talking here about the personal information. The information we are collecting at BOINC is not personal.

Oct 18 '23 07:10 AenBleidd

we had two people in the discussion who have a good and quite recent experience with the GDPR.

Would it be too much to ask for to disclose who were the people that decided "that initially the implementation of the GDPR compliance was made too strict", apparently for the whole of BOINC?

Oct 18 '23 07:10 bema-aei

@bema-aei,

David P. Anderson
Warren T. B. Lucas
James C. Owens
Vitalii Koshura (me)

Oct 18 '23 07:10 AenBleidd

Let's focus on the question of what to include in the XML stats export. The web code currently shows, for each user,

user name
URL
avg and total credit
user ID
signup date
country See, for example, https://boinc.berkeley.edu/test/show_user.php?userid=1321

This data contributes to community functions. No users have ever complained about it.

The web site shows all this regardless of 'consent' settings, which are therefore meaningless. Someone could scrape this data for all users if they wanted.

Also: the web site shows a user's hosts unless the "don't show hosts" flag is set in their project preferences. Host data could also be scraped. So the consent setting is meaningless; only the prefs setting matters.

One item not shown on the web is the user's cross-project ID. I don't view this as private data. All it does is match accounts on different projects.

So it seems to me that what we should do is:

export the above user data, including CPID, in XML
export all hosts unless user has "don't show hosts" in their project prefs.

Oct 18 '23 08:10 davidpanderson

from my point of view: we need to know

Who is "we" (as in entity/body, not individual person)? I still understand this to be mainly the BOINC software project, not the projects using BOINC. Only the latter are the relevant data controllers in this discussion. As you correctly said, "BOINC doesn't collect any personal information", so it effectively (legally) doesn't matter here. It's the data controllers, i.e. the downstream projects that "carry all duties, responsibilities and legal consequences", so it's entirely up to them how they interpret the GDPR and how they act accordingly.

What fraction of all active BOINC projects do you think you need to comply with your proposal such that the figures exported become meaningful for what you're trying to achieve?
Why do you need these details (from ideally all projects)?
What happens if you don't get them?

Has anyone involved i this discussion spoken to the Information Commissioner in an EU country?

Not sure what you mean exactly by "Information Commissioner" but, yes, we as a EU project (which is part of one of the largest scientific organizations) have a "data protection officer" since we're legally obliged to, and she was involved in the way we enact the GDPR as well as defining our privacy policy.

Thus it's a pretty bold statement (not by you) to say "people had no understanding what is the GDPR about". As if there is a universal truth on how to interpret any given law, like the GDPR. There isn't. That's why courts exist.

Someone could scrape this data for all users if they wanted

Sure! But that's not what this proposal is about. This isn't about the processing of data on your own site. This proposal is about replacing the current opt-in for data transfers with an opt-out. Data transfers come with a lot of legal baggage for the data controllers (us) and we are already taking that extra load to provide the stats sites with data. But we still need a legal basis to do so and the direct user's consent is the best way to establish just that. Any of the other options are arguably hard to justify.

Yes, we can agree to disagree on what might constitute personal data but that's only for the data controllers to decide. Should they decide that personal data might be involved they almost certainly will need the user's consent, and opt-out doesn't fulfill the conditions for consent.

Oct 18 '23 08:10 brevilo

Not sure what you mean exactly by "Information Commissioner" but, yes, we as a EU project (which is part of one of the largest scientific organizations) have a "data protection officer" since we're legally obliged to, and she was involved in the way we enact the GDPR as well as defining our privacy policy.

The "Information Commissioner" is the part of a national government that is charged (by the national government & EU) to administer the GDPR regulations within that country. Your "Data Protection Officer" will have set up your organisation's policies and procedures in consultation with, and under the guidance of, the IC. (For some pan-European organisations this will be the EU's IC). As you say, there's a lot of legal stuff that has to be considered, but the underlying basis is that the individual's privacy and property is protected from "unwelcome attention of those who would harm the person's privacy or property".

Oct 18 '23 09:10 robsmith1952

Your "Data Protection Officer" will have set up your organisation's policies and procedures in consultation with, and under the guidance of, the IC.

Thanks for the clarification. You're correct. Which means my answer to your original question is "Yes, we have".

Oct 18 '23 09:10 brevilo