openneuro icon indicating copy to clipboard operation
openneuro copied to clipboard

ds000113 nifti file content differs from original openfmri upload

Open mih opened this issue 3 years ago • 5 comments

We are attempting to reconvert the original data release of ds000113 (from the old openfmri days) from DICOMs in order to get maximum metadata, and, importantly more valid metadata.

In doing so we investigated with @m-wierzba how the original upload to openfmri differs from the dataset the is presently downloadable from openneuro. We focused on the file content (checksum based match), because the filenames and dataset layout have changed obviously.

It seems that most NIfTI images have been altered. We suspect that some kind of header normalization procedure was applied. It would be instrumental for us to understand what exactly was done.

Below is the diff of the output of fslhd on one and the same file, comparing original upload to download offer.

Specifically, it seems that qform and sform have been equalized, an FSL tools replaced the (arguably more informative) description, and some numerical uncertainty has slightly altered the image affine.

fslhd diff
--- oldhd.txt   2021-04-27 14:24:04.254049923 +0200
+++ newhd.txt   2021-04-27 14:24:19.638214798 +0200
@@ -1,4 +1,4 @@
-filename       ../anondata/sub001/BOLD/task001_run004/bold.nii.gz
+filename       ds000113/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-raw_run-04_bold.nii.gz
 
 sizeof_hdr     348
 data_type      INT16
@@ -18,7 +18,7 @@
 pixdim0        0.000000
 pixdim1        1.400000
 pixdim2        1.400000
-pixdim3        1.539948
+pixdim3        1.539950
 pixdim4        2.000000
 pixdim5        0.000000
 pixdim6        0.000000
@@ -45,23 +45,23 @@
 intent_p3      0.000000
 qform_name     Scanner Anat
 qform_code     1
-qto_xyz:1      -1.399780  0.009211  0.025353  105.693428
-qto_xyz:2      0.003949  1.366110  -0.336755  -66.909798
-qto_xyz:3      0.024505  0.306038  1.502462  -57.281765
+qto_xyz:1      -1.399780  0.009212  0.025360  105.693001
+qto_xyz:2      0.003948  1.366110  -0.336755  -66.909798
+qto_xyz:3      0.024512  0.306037  1.502464  -57.281799
 qto_xyz:4      0.000000  0.000000  0.000000  1.000000
 qform_xorient  Right-to-Left
 qform_yorient  Posterior-to-Anterior
 qform_zorient  Inferior-to-Superior
-sform_name     Unknown
-sform_code     0
-sto_xyz:1      0.000000  0.000000  0.000000  0.000000
-sto_xyz:2      0.000000  0.000000  0.000000  0.000000
-sto_xyz:3      0.000000  0.000000  0.000000  0.000000
-sto_xyz:4      0.000000  0.000000  0.000000  0.000000
-sform_xorient  Unknown
-sform_yorient  Unknown
-sform_zorient  Unknown
+sform_name     Scanner Anat
+sform_code     1
+sto_xyz:1      -1.399780  0.009212  0.025360  105.693001
+sto_xyz:2      0.003948  1.366110  -0.336755  -66.909798
+sto_xyz:3      0.024512  0.306037  1.502464  -57.281799
+sto_xyz:4      0.000000  0.000000  0.000000  1.000000
+sform_xorient  Right-to-Left
+sform_yorient  Posterior-to-Anterior
+sform_zorient  Inferior-to-Superior
 file_type      NIFTI-1+
 file_code      1
-descrip        mi_ep2d_flashref_bold_160_iPat3_1.4mm_36sl_R4
+descrip        FSL5.0
 aux_file       

If we understand what was done, and if we are successful in creating a modern day BIDS dataset from the original DICOMs and other raw material, we would like to propose that as an update of the present ds000113 -- as an attempt to reunite the currently disjoint histories of the datasets that we maintain.

Thx in advance!

mih avatar Apr 27 '21 13:04 mih

This was before my time. I think @chrisgorgo and @suyashdb (possibly @jbwexler?) are most likely to remember what happened here. If I had to guess, it might just be fslreorient2std was applied to all data.

I don't see anything in your old header that demands normalizing, such as leaked PHI, so would be +1 to reverting to the original files unless someone who remembers the event can provide a good argument.

effigies avatar Apr 27 '21 13:04 effigies

Thanks for the swift reply!

re fslreorient2std: Great, we will try that out and see if it can explain the diff.

re update: I know that you only accept fast-forward changes. So here is what the change would look like -- if all goes well.

  • added subdataset: referencing our project superdatasets that binds together all studyforrest components
  • datalad-run record, capturing a datalad-copyfile run that updates the files from a defined state of our cannonical source superdataset

IOW two additional commits (per update, if we think long-term). This way the provenance link is precise, without forcing the repository history to become one. The linked subdataset (our superdataset) will live in some public place like github, so consumers can obtain it, and do further inspection, if desired.

Do you see issues caused by such an approach?

mih avatar Apr 27 '21 13:04 mih

The main issues I see are:

  1. Where is the superdataset going to be placed? sourcedata/? Something .bidsignored?
  2. I'm not sure if we smoothly handle git submodules yet. That's definitely on the roadmap if it's not yet complete. @nellh would probably know best.

Thinking about it, --ff-only can incorporate merge commits, so I think that we can accept a merged history (again, @nellh is the authority), if you can get a clean merge and the resulting dataset validates. Force pushes are definitely disallowed, and I'm not positive about grafts but would recommend against trying anyway.

Would this be cleaner, or is the copyfile approach cleanest?

effigies avatar Apr 27 '21 13:04 effigies

  1. Where is the superdataset going to be placed? sourcedata/? Something .bidsignored?

Yes, likely.

  1. I'm not sure if we smoothly handle git submodules yet. That's definitely on the roadmap if it's not yet complete. @nellh would probably know best.

Will keep that in mind. Thx!

Thinking about it, --ff-only can incorporate merge commits, so I think that we can accept a merged history (again, @nellh is the authority), if you can get a clean merge and the resulting dataset validates. Force pushes are definitely disallowed, and I'm not positive about grafts but would recommend against trying anyway.

Would this be cleaner, or is the copyfile approach cleanest?

The rational for not merging history is that we do not have a single local repository that contains all the pieces. We have the four individual ones from which the original four openfmri datasets were created. The present openneuro ds000113 is a later amalgamation of those, not done by us (upstream). So we aim to provide a continuation for these two different types of entities, maintain an accurate dataset on openneuro, but also stay linked to the data descriptors that we can no longer change.

mih avatar Apr 27 '21 16:04 mih

I searched my slack and email history but unfortunately didn't find anything related to the alteration of nifti files of this particular dataset. I do recall there was a period when we commonly used fslreoreint2std before rerunning pydeface if the initial pydeface failed. There was also a brief period when we would use fsl to change pixdim4 if it didn't match the paper, until we decided it was better to leave the header as is. I recall that changing pixdim4 would also alter other fields slightly

jbwexler avatar Apr 29 '21 18:04 jbwexler