[WIP] Suggested modifications to directory layout of the bids-study `DatasetType`
This is a continuation of discussion on BIDS DatasetType started in #1972 and #2185. The proposed DatasetTypes include: bids-study, raw, and derivative.
I think this is a fantastic idea with the capacity to curate modular and nested standardized datasets.
However, the directory layout for bids-study proposed in #2185 is suboptimal from the data visibility perspective. Currently, the bids-study does not have a root-level subdirectory for bids-raw dataset. Based on previous conversations it was suggested that bids-raw dataset can be stowed inside sourcedata dir. This hidden location for a valid and probably the most common DatasetType is probably not ideal and will confuse new users.
So, I would like to make a case for treating sourcedata, bids-raw, and bids-derivative with equal importance by putting them on the same root-level inside the bids-study directory tree. In my experience this is more intuitive and helpful for data management of neuroimaging studies, where potentially different people will handle these three DatasetTypes.
The suggested bids- prefix for these directories is mostly to avoid possible confusion between source vs raw data connotations from past discussions and to indicate each of them can be stand-alone BIDS datasets.
Happy to hear your thoughts @michellewang, @jbpoline, @yarikoptic, @effigies, @mathdugre, @julia-pfarr, @nburgos, @AliceJoubert, @Adam-Ismaili-92, @surchs
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 82.71%. Comparing base (dd1e5d2) to head (924d1c5).
:warning: Report is 12 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #2191 +/- ##
=======================================
Coverage 82.71% 82.71%
=======================================
Files 20 20
Lines 1608 1608
=======================================
Hits 1330 1330
Misses 278 278
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
This makes a lot of sense to me, and would help everyone to more straightforwardly differentiate source-data and raw-bids much more easily.
This solution feels much clearer, I think it's a good idea !
@effigies - Thanks for the feedback and patiently dealing with these multiple intertwined open issues and hurried PR from my side - a result of mixed-up online and offline conversations. Will respond individually to the inline comments.
re-posting my comment from bids-specification/bids-website#688 here again since it was misplaced in that PR:
I kind of don't understand why we need to serve all these different cases of where to store what?
Why can't we just say
root/sourcedata--> out of scanner dataroot/raw--> raw data in BIDSroot/derivatives--> anything starting with preprocessing, in BIDS- If you have only one of those three, adding a second level folder is not necessary.
There is a lot to discuss and unpack. And I do like the clarity of sourcedata (non-BIDS), raw (BIDS), derivatives (BIDS). But, then immediately questions come up, in particular with account that "study" BIDS dataset is just a "base" case of any BIDS dataset, just without direct data (sub-*) in it. Questions:
- where do we place non-BIDS derivatives?
- where, for a BIDS derivative dataset, do we place BIDS input datasets used to produce it
- when they are "raw BIDS"
- when they are "derivative BIDS"
- when they are a mix of both
My point, is that ATM sourcedata/ and derivatives/ separation and lack of further formalization under them is made to foster flexibility. That is where any BIDS dataset can place relevant source data (from scanner or other BIDS datasets, BIDS raw or not) under sourcedata/ and then store produced derivatives (some of which might be BIDS) produced either from sourcedata/ ones (in case of "study") and/or current BIDS dataset itself (if not "study") - under derivatives/.
sourcedata/raw within study is just a suggestion since derivatives/ are for "derived from BIDS datasets" hence "raw BIDS" cannot be there by definition.
Going back to items
- 1: it does not need to be all or nothing: "study"
DatasetTypejust follows what BIDS defines already for "raw" and "derivative" types -- basic folder structure, files and metadata at the root. As soon as we formalizesourcedata/orderivatives/more -- it would benefit from that too. Similarly BEP on stimuli tries to further formalizestimuli/folder. - 2: sorry -- not following there, may be partially due to weak knowhow on
phenotype/ - 3: what is
StudyType-- we do not have that in bids yet I think?
For the sake of moving forward with this PR, let’s only review and approve/disapprove of the changes made by @nikhil153 and move this bigger discussion to a dedicated meeting (maybe at distribits?). Wdyt @yarikoptic and @effigies, could you do a final review on Nikhil's changes?
Still, I'll post my answers here, to use this PR as the starting point for following discussions (and that I don't forget what I wanted to say...):
-
I actually did not mean an "all or nothing" approach but my thoughts were very much like yours. At the very base, I assume
bids-studymeans I should follow BIDS wherever is possible. Meaning, that if I have data in thederivativesfolder, it should follow a BIDS spec IF there is one. If there is none, then it’s not in BIDS but that’s fine because we don’t have a spec for this (yet). Soderivativescan hold both, non-BIDS and BIDS derivatives. My point here was more about if we need abids-prefix for the folders or not. So "it is BIDS" is the umbrella assumption for everything that is in the study but not a hard requirement. -
This was just me voting for having a
/phenotypefolder at root level. Since the current phenotype BEP does not make any layout suggestions for/phenotypewithinbids-study, I feel we have some freedom here to decide where the/phenotypefolder should/could be. I did not see anything in the current version of the phenotype BEP that would be violated by having/phenotypeat root level in thebids-studylayout. -
No, we don’t have that yet. We can talk about this another time and separate this discussion from this one!
Your other questions:
where do we place non-BIDS derivatives?
This I addressed in my reply to 1.
where, for a BIDS derivative dataset, do we place BIDS input datasets used to produce it * when they are "raw BIDS" * when they are "derivative BIDS" * when they are a mix of both
This is a point where you and I (and probably BIDS and I) don’t use the same meaning for sourcedata.
- Sourcedata for me is any data in it’s barest form, like right after its „birth“. And only this.
- Raw for me is non-processed data but data that is in a standard and interoperable format.
- Derivative for me is everything that was processed, no matter if it was processed directly from
sourcedataorrawor otherderivatives.
For me it does not make a difference if this dataset was used as an input, it is NOT sourcedata if it is in the raw or derivatives form. Whereas you use sourcedata for anything that was an input for something, no matter of its form.
So, for me, the answers would be:
- „where do we place BIDS input datasets used to produce it when they are „raw BIDS““ —> in
/raw - „where do we place BIDS input datasets used to produce it when they are „derivative BIDS““ —> in
/derivatives - „where do we place BIDS input datasets used to produce it when they are a mix of both“ —> separated in
/rawand/derivatives.
We do have the Sources metadata for the derivatives, so it shouldn’t be an issue if inputs are in different places.
2cents : the very flexible specification comes at the cost of loss of useful information in the standard, I would certainly make "raw"/"rawbids" the only (first?) place to search for a BIDS validated dataset, that would be useful for the tools and mental clarity : source data == not BIDS, raw/rawbids == BIDS
I like the idea, obviously, but I wonder how this conflicts with the existing structure —i.e., how the validator behaves and whether 'we' want to differentiate cases. Concretely,
MyData/
|- source
|- derivatives
|- sub-01
vs
MyStudy/
|- bids-source
|- bids-derivatives
|- bids-raw
|- sub-01
- source and bids-source are ignored
- derivatives and bids-derivatives are validated for whatever parts follow BIDS
- MyData (a BIDS dataset) is validated to same as bids-raw except for Modality agnostic files
- MyStudy has an extra validation ? and Modality agnostic files are at this level
@yarikoptic @effigies how is it going with this PR?
Regarding phenotype, pulling it out of the primary dataset seems to run counter to BEP36, which aims to put additional constraints on phenotype. Because many of those rules hinge on relationships between phenotype files and root-level participants.tsv/sessions.tsv, I think this would relax those new rules almost entirely.
I also have a pair of questions:
-
In each of
sourcedata/,rawdata/andderivatives/, is the contents of each of these directories supposed to be a dataset, or are they a collection of datasets, or are they potentially an arbitrary directory structure with datasets appearing somewhere within? Just to illustrate my meaning:study/ sourcedata/ dicoms/ <- second level physio/ <- second level rawdata/ dataset_description.json <- first level derivatives/ subdirA/ subdirB/ dataset_description.json <- third level subdirC/ subdirD/ subdirE/ dataset_description.json <- fourth level -
Is a validator expected to find BIDS datasets in any/all of these subdirectories and validate them, or do these all remain opaque, and it's the responsibility of the curator/archive to validate subdatasets?
The simple answers: 1) The contents are arbitrary (at least as far as this PR is concerned); 2) No (at least as far as this PR is concerned).
You could go a bit further and say that rawdata/ is expected to be a BIDS dataset, which is a very straightforward directive for a toolwriter to follow (see study, look in rawdata/), but that then rules out the possibility of multiple raw datasets.
I realize these aren't problems Nikhil is introducing, but up to now, there has been a validator behavior for sourcedata/ and derivatives/: ignore their contents. When re-introducing rawdata/, it's an opportunity to think through whether BIDS has anything to say inside these directories.
@effigies,
Re: phenotype, if that's the case, then wouldn't it be an issue for the derivative DatasetType as well? Happy to discuss the implication in the BEP36 thread, as I am not sure if it currently considers the newer DatasetTypes (i.e. derivative and study).
Re: questions, my view would be:
sourcedata : arbitrary
rawdata: single BIDS raw DatasetType (i.e. no collection of datasets)
derivatives: non-nested (i.e. no derivatives within derivatives) collection of BIDS derivative DatasetType.
I think allowing multiple raw datasets would create several complications that are better discussed at a later time - possibly under BIDS-MEGA BEP / BIDS2.0?
For derivatives directory, allowing nested sub-directory structure as in your example is okay. As long there there is no nesting of Datasets themselves i.e. no nesting of dataset_description.json files. Hopefully, this would allow enough flexibility for pipeline outputs while avoiding recursion and keeping the validator logic relatively simple.