datalad-neuroimaging
datalad-neuroimaging copied to clipboard
ENH: run-procedure for BIDS dataset configuration
I'm wondering if it would be useful to add a run-procedure to this extension to configure BIDS+datalad datasets such that all files in the root BIDS directory are committed to git
while all the rest of the files, irrespective of type, go to the annex?
I'm thinking of use-cases related to distributed dataset-level metadata extraction and catalog generation. Data in the annex (typically all subfolders of the root BIDS directory) would need to be protected because of data privacy concerns, while data in the root directory (participants.tsv
, dataset_description.json
, any json sidecar files defined at the root level, any additional dataset-level metadata added at root level) are typically considered non-sensitive or have specifically been edited to be so, and can therefore be considered safe to commit to git.
Configuring a dataset like that (as opposed to annexing all files in the dataset) would allow sufficient metadata extraction on any clone without requiring access to the annex.
The run procedure would add something like this to .gitattributes
:
* annex.largefiles=anything
/* annex.largefiles=nothing
The procedure (let's call it rootfiles2git
) would be available in this extension because it seems (to me) like it could be generally applicable to BIDS datasets collected in the EU (because of GDPR).
WDYT @yarikoptic @bpoldrack @mslw @cpernet @loj
that's the 'standard' way to approach a BIDS dataset, make sense to see root directory info (=git) while the rest goes into the annex (also make easy catalog :-)) 👍🏻
Generally, I think it does make sense, but the problem lies in
or have specifically been edited to be so
Editing something to be so, implies that there was a state before that, which must never have been datalad save
'd. Such a setup doesn't really allow for mistakes, since you can't easily get things out of git again. Kinda the point of version control.
That's why I'd hesitate recommending a specific config from the start. It really depends when in your workflow you'd want to apply that.
Fair point, although that problem/challenge exists whether one applies a run-procedure or not. It is something that the people managing the data would need to consider in any case when they turn it into a datalad dataset.
Yes, but a default that annexes everything doesn't lead you in a trap.
Public and restricted content can still be separated in terms of storage. May be a little less convenient, but you don't get in a situation that is really hard to fix.
To be fair: The existence of a procedure isn't exactly a default. I'm a bit worried though, that it goes the way of text2git
. Pointed out as convenience in a toy example in documentation and then everybody starts using it without realizing its disadvantages.
I think this is a sane approach, with two caveats (though keep in mind that my knowledge of BIDS spec might be not up to date):
- With inheritance principle for BIDS metadata, there is no guarantee that a metadata file in top level directory describes all matching data, as values defined on top level can be overridden by files deeper in the file tree. E.g. fMRI task information: TaskName, RepetitionTime, SliceTiming, etc., in
...task-xyz_bold.json
can be defined on any level (either top level or just next to the specific_bold.nii
file). It seems to me that it has become a fairly common principle to promote these to top-level (and for good reason), but technically there is no guarantee of dataset-scope. - Speaking of
participants.tsv
, this is a recommended file, and commonly used optional columns in participants.tsv files are age, sex, handedness, strain, and strain_rrid - I wonder what is the status of these.
My biggest concern with this approach is when participants need to be removed. If the participants.tsv
file or any other top-level file that contains participant data is saved to git
, this becomes problematic.
I agree, I would be hesitant to put anything other than a README and a LICENSE into git by default.
Code is another candidate, but only if the file identifiers are at minimum pseudonymized.
I agree, I would be hesitant to put anything other than a README and a LICENSE into git by default.
and CHANGE(S|LOG)
, with all sensible/support extensions, is indeed the "safest"! Worth smth like cfg_minimal2git
or alike (it isn't really BIDS specific probably).
There is always a "hard to strike" balance in what to put into git and what into git-annex. For heudiconv all .json and .tsv go into git
besides the _scans.tsv
since those are to contain full dates. The minimal above would be "safest" but then forget about lovely git grep
etc which I do like to use quite often in BIDS etc datasets.
Thanks for everyone's input!
FYI @CPernet there is already a standard BIDS config that does the above process to an extent. See here for an update: https://github.com/datalad/datalad-neuroimaging/pull/115.