mne-bids icon indicating copy to clipboard operation
mne-bids copied to clipboard

When anonymizing, allow to remove participant age too

Open hoechenberger opened this issue 3 years ago • 27 comments

Converting a small-N study from @SophieHerbst to BIDS and passing anonymize to write_raw_bids(), we found that participants.tsv still contained the participants' ages. This is probably intentionally so, but it can cause trouble in studies with small numbers of participants, where age could be used to allow for a post-hoc association with a particular person.

Therefore, anonymization should optionally drop age from participants.tsv as well. I'm not sure about the API though, as currently, write_raw_bids()'s anonymize parameter accepts a dictionary that is then directly passed to Raw.anonymize(). I wonder if we could add an additional dictionary key, keep_age=False, in alignment with keep_his=False that we currently have.

hoechenberger avatar May 11 '21 14:05 hoechenberger

if we remove age from fif files we should remove it from participants.tsv

we should be consistent. See keep_his parameter

agramfort avatar May 11 '21 15:05 agramfort

The FIF only stores the date of birth, if I'm not mistaken

hoechenberger avatar May 11 '21 17:05 hoechenberger

Yes but to me if we remove date of birth we should remove age from participants.tsv

Does it make sense?

agramfort avatar May 11 '21 19:05 agramfort

@agramfort

Yes but to me if we remove date of birth we should remove age from participants.tsv

Yes and no.

I believe it was a conscious decision not to remove the age, because age is sometimes required / extremely useful for certain analyses – even if all other personal identifying information (PII) has been dropped from the data.

Imagine the research question of "brain age" vs "calendar age" that @dengemann is working on. Here, it could be imperative to retain participants' age, even when sharing otherwise anonymized data.

Therefore I believe we should allow participants to anonymize the data while retaining age, even though the date of birth gets removed.

Thoughts?

ping @sappelhoff

hoechenberger avatar May 12 '21 06:05 hoechenberger

I think it should be optional. The question is if someone can easily match the participant with an external description file. If the goal is to anonymize the data, age must be of course removed.

dengemann avatar May 12 '21 06:05 dengemann

How would you imagine the workflow to be?

Say you've measured 50 participants, and calendar age plays a central role in the analysis you published in Nature. Of course you want to make the data publicly available. So you anonymize it – including removal of age. But now no-one will be able to replicate your published analysis anymore, because the important variable "age" is now missing. Are they supposed to get in touch with you and ask for the age associated with each participant ID?

hoechenberger avatar May 12 '21 06:05 hoechenberger

maybe add just a new function in mne_bids to a posteriori remove age?

or as anonymize can be a dict we can add a valid key to say if age should be written

anonymize=dict(write_age=False)

agramfort avatar May 12 '21 07:05 agramfort

or as anonymize can be a dict we can add a valid key to say if age should be written

that sounds good to me, re-using our existing API, adding one more option with a sensible default (remove age)

sappelhoff avatar May 12 '21 07:05 sappelhoff

Ok. Current API is:

    anonymize : dict | None
        If `None` (default), no anonymization is performed.
        If a dictionary, data will be anonymized depending on the dictionary
        keys: ``daysback`` is a required key, ``keep_his`` is optional.

        ``daysback`` : int
            Number of days by which to move back the recording date in time.
            In studies with multiple subjects the relative recording date
            differences between subjects can be kept by using the same number
            of ``daysback`` for all subject anonymizations. ``daysback`` should
            be great enough to shift the date prior to 1925 to conform with
            BIDS anonymization rules.

        ``keep_his`` : bool
            If ``False`` (default), all subject information next to the
            recording date will be overwritten as well. If True, keep subject
            information apart from the recording date.

My proposal:

    anonymize : dict | None
        If `None` (default), no anonymization is performed.
        If a dictionary, data will be anonymized depending on the dictionary
        keys: ``daysback`` is a required key, ``keep_his`` and ``keep_age`` are optional.

        ...

        ``keep_age`` : bool
            Whether to retain age information even when ``keep_his=False``. This can be used
            to remove the date of birth and all other personal identifying information from the data,
            while still keeping the age in ``participants.tsv``. If ``False`` (default), remove age when
            ``keep_his=False``. If ``True``, retain age.

hoechenberger avatar May 12 '21 07:05 hoechenberger

LGTM. just a small voice in my head whether we'll have a discussion about "remove everything EXCEPT <insert some other HIS aspect here>", so whether keep_age=True should be a keep=["age", "...."].

Or does that fall under YAGNI? :)

sappelhoff avatar May 12 '21 08:05 sappelhoff

both would work for me.

+0.5 on keep_age key

agramfort avatar May 12 '21 10:05 agramfort

so whether keep_age=True should be a keep=["age", "...."].

I imagine this being a little annoying for users:

write_raw_bids(..., anonymize=dict(daysback=123, keep=['age']))

seems a little complex for what we're trying to do here

hoechenberger avatar May 12 '21 10:05 hoechenberger

Just one fly-by comment. I think it's more common to keep the age rather than not keep it. So, I would name the argument drop instead of keep so most users don't have to specify it

jasmainak avatar May 12 '21 14:05 jasmainak

So, I would name the argument drop instead of keep so most users don't have to specify it

But MNE has this keep_his thing, so I'd like to call the param keep_* for consistency. We can default it to True, though!

hoechenberger avatar May 12 '21 19:05 hoechenberger

But MNE hat this keep_his thing, so I'd like to call the param keep_* for consistency. We can default it to True, though!

Unless we change the MNE-BIDS API:

anonymize = dict(daysback=123, drop_pii=True, drop_age=False)

(no-one knows what his means, do they???)

hoechenberger avatar May 12 '21 19:05 hoechenberger

In fact, we wouldn't even need drop_pii or keep_his, because if a user requests to anonymize, of course they want to remove personal itentifying info too, no? scratches head

hoechenberger avatar May 12 '21 19:05 hoechenberger

Okay yes indeed, we can make the default True so user doesn't have to specify it! The anonymize dict was made so that you could pass it to mne.anonymize_info.

Regarding his, see here: https://github.com/mne-tools/mne-matlab/blob/master/matlab/fiff_define_constants.m#L241

jasmainak avatar May 13 '21 23:05 jasmainak

also see this: https://mne-cpp.github.io/pages/documentation/anonymize.html

jasmainak avatar May 13 '21 23:05 jasmainak

I've been thinking about this and I'd like to change our anonymization-related API.

Currently, we have this anonymize parameter in write_raw_bids(), which is supposed to be a dictionary whose key-value pairs will be passed to Raw.anonymize().

I don't think this is very intuitive for several reasons:

  • If I read an imperative verb like anonymize, I'd expect a boolean – True to anonymize, False to not anonymize
  • If I want to anonymize the data, I want the data to be … anonymized. I don't want to and shouldn't need to think about this keep_his thing – it should always be False, as I see no reason for it not to be False if a user wants to anonymize their data
  • Considering that keep_his is superfluous, sticking with the current write_raw_bids() signature would leave us with anonymize=dict(daysback=123) – not great. I'd prefer to have a separate parameter to specify the "days back", e.g., anonymize_daysback

Now that I have typed this out, I'm thinking whether we could simply drop anonymize and add anonymize_daysback: None | int. If None, don't anonymize. And if we do anonymize, also erase the his and the participant's age.

WDYT?

hoechenberger avatar Jun 07 '21 20:06 hoechenberger

I thought about this some more and I think I've changed my mind and would like to keep anonymize as a dict, but it should accept the following keys:

  • daysback: int (like it does already)
  • age: bool = True (control whether age should be dropped or not)

WDYT?

hoechenberger avatar Jun 07 '21 20:06 hoechenberger

yes for camcan you anon but you keep age. you think keep_his is too cryptic? how about gender?

agramfort avatar Jun 07 '21 20:06 agramfort

yes for camcan you anon but you keep age. you think keep_his is too cryptic? how about gender?

Not sure I understand your question – keep_his refers to an ID from the hospital information system (if I'm not mistaken), and MNE can remove the his_id from the info, which I assume we should always do when anonymizing.

Gender/sex is a good point, and reminded me that even handedness might be an issue for small-N studies. So my proposal:

anonymize = dict(daysback: int, age: bool = True, sex: bool = True, hand: bool = True)

Thoughts?

hoechenberger avatar Jun 07 '21 21:06 hoechenberger

I am just a bit worried to deviate from mne-python API

agramfort avatar Jun 08 '21 08:06 agramfort

I am just a bit worried to deviate from mne-python API

Yes, but we'll have to do that anyway, as we want to optionally allow to keep age, handedness, sex, …

hoechenberger avatar Jun 09 '21 09:06 hoechenberger

Just chiming in here with a random question I had related to mne-bids anonymizing: is it common to require age, handedness and sex to be scrubbed to be "anonymized"? I know in the USA that's not considered PHI (vs birthdate, recording date).

Is the reason to just provide an extra degree of anonymity?

adam2392 avatar Jun 09 '21 15:06 adam2392

maybe add just a new function in mne_bids to a posteriori remove age?

I like the a-posteriori idea too FWIW because sometimes you write stuff to BIDS and then want to add this extra layer of anonymization, and end up either having to do it manually, or rewrite.

adam2392 avatar Jun 09 '21 15:06 adam2392

Is the reason to just provide an extra degree of anonymity?

Imagine a small-N study you conduct among your colleagues, and maybe only one of them is left-handed or one is much younger or older than the others... then those data could easily be used to deanonymize things.

hoechenberger avatar Jun 09 '21 15:06 hoechenberger