CuBIDS icon indicating copy to clipboard operation
CuBIDS copied to clipboard

Add option to anonymize acquisition datetimes in scans.tsv

Open tsalo opened this issue 1 year ago • 1 comments

Following a discussion in today's informatics scrum, I was thinking that it would be nice to be able to anonymize acquisition datetimes in the scans.tsv files (and potentially in sidecar JSON files). @mattcieslak thought this could be made part of the purge-metadata command.

Desiderata:

  1. Set first scan's acquisition to 1800/01/01.
  2. Give users the option to either anonymize the full datetime or just anonymize the date (i.e., retain the time of day).
  3. Preserve relative timing between scans in each session.
  4. Preserve relative timing between sessions.

tsalo avatar Aug 17 '23 14:08 tsalo

Here's some code I've used to do this in another project:

"""Anonymize acquisition datetimes for a dataset.

Anonymize acquisition datetimes for a dataset. Works for both longitudinal
and cross-sectional studies. The time of day is preserved, but the first
scan is set to January 1st, 1800. In a longitudinal study, each session is
anonymized relative to the first session, so that time between sessions is
preserved.

Overwrites scan tsv files in dataset. Only run this *after* data collection
is complete for the study, especially if it's longitudinal.
"""
import os
from glob import glob

import pandas as pd
from dateutil import parser

if __name__ == "__main__":
    dset_dir = "/path/to/dset"

    bl_dt = parser.parse("1800-01-01")

    subject_dirs = sorted(glob(os.path.join(dset_dir, "sub-*")))
    for subject_dir in subject_dirs:
        sub_id = os.path.basename(subject_dir)
        print(f"Processing {sub_id}")

        scans_files = sorted(glob(os.path.join(subject_dir, "ses-*/*_scans.tsv")))

        for i_ses, scans_file in enumerate(scans_files):
            ses_dir = os.path.dirname(scans_file)
            ses_name = os.path.basename(ses_dir)
            print(f"\t{ses_name}")

            df = pd.read_table(scans_file)
            if i_ses == 0:
                # Anonymize in terms of first scan for subject.
                first_scan = df["acq_time"].min()
                first_dt = parser.parse(first_scan.split("T")[0])
                diff = first_dt - bl_dt

            acq_times = df["acq_time"].apply(parser.parse)
            acq_times = (acq_times - diff).astype(str)
            df["acq_time"] = acq_times
            df["acq_time"] = df["acq_time"].str.replace(" ", "T")

            # Delete the original file instead of just overwriting it, for Datalad.
            os.remove(scans_file)

            df.to_csv(
                scans_file,
                sep="\t",
                line_terminator="\n",
                na_rep="n/a",
                index=False,
            )

tsalo avatar Sep 07 '23 12:09 tsalo