bids-specification icon indicating copy to clipboard operation
bids-specification copied to clipboard

[WIP] Suggested modifications to directory layout of the bids-study `DatasetType`

Open nikhil153 opened this issue 3 months ago • 12 comments

This is a continuation of discussion on BIDS DatasetType started in #1972 and #2185. The proposed DatasetTypes include: bids-study, raw, and derivative.

I think this is a fantastic idea with the capacity to curate modular and nested standardized datasets.

However, the directory layout for bids-study proposed in #2185 is suboptimal from the data visibility perspective. Currently, the bids-study does not have a root-level subdirectory for bids-raw dataset. Based on previous conversations it was suggested that bids-raw dataset can be stowed inside sourcedata dir. This hidden location for a valid and probably the most common DatasetType is probably not ideal and will confuse new users.

So, I would like to make a case for treating sourcedata, bids-raw, and bids-derivative with equal importance by putting them on the same root-level inside the bids-study directory tree. In my experience this is more intuitive and helpful for data management of neuroimaging studies, where potentially different people will handle these three DatasetTypes.

The suggested bids- prefix for these directories is mostly to avoid possible confusion between source vs raw data connotations from past discussions and to indicate each of them can be stand-alone BIDS datasets.

Happy to hear your thoughts @michellewang, @jbpoline, @yarikoptic, @effigies, @mathdugre, @julia-pfarr, @nburgos, @AliceJoubert, @Adam-Ismaili-92, @surchs

nikhil153 avatar Sep 02 '25 02:09 nikhil153

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 82.71%. Comparing base (dd1e5d2) to head (924d1c5). :warning: Report is 12 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2191   +/-   ##
=======================================
  Coverage   82.71%   82.71%           
=======================================
  Files          20       20           
  Lines        1608     1608           
=======================================
  Hits         1330     1330           
  Misses        278      278           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Sep 02 '25 02:09 codecov[bot]

This makes a lot of sense to me, and would help everyone to more straightforwardly differentiate source-data and raw-bids much more easily.

jbpoline avatar Sep 03 '25 00:09 jbpoline

This solution feels much clearer, I think it's a good idea !

AliceJoubert avatar Sep 04 '25 15:09 AliceJoubert

@effigies - Thanks for the feedback and patiently dealing with these multiple intertwined open issues and hurried PR from my side - a result of mixed-up online and offline conversations. Will respond individually to the inline comments.

nikhil153 avatar Sep 09 '25 19:09 nikhil153

re-posting my comment from bids-specification/bids-website#688 here again since it was misplaced in that PR:

I kind of don't understand why we need to serve all these different cases of where to store what?

Why can't we just say

  • root/sourcedata --> out of scanner data
  • root/raw --> raw data in BIDS
  • root/derivatives --> anything starting with preprocessing, in BIDS
  • If you have only one of those three, adding a second level folder is not necessary.

julia-pfarr avatar Oct 08 '25 17:10 julia-pfarr

There is a lot to discuss and unpack. And I do like the clarity of sourcedata (non-BIDS), raw (BIDS), derivatives (BIDS). But, then immediately questions come up, in particular with account that "study" BIDS dataset is just a "base" case of any BIDS dataset, just without direct data (sub-*) in it. Questions:

  • where do we place non-BIDS derivatives?
  • where, for a BIDS derivative dataset, do we place BIDS input datasets used to produce it
    • when they are "raw BIDS"
    • when they are "derivative BIDS"
    • when they are a mix of both

My point, is that ATM sourcedata/ and derivatives/ separation and lack of further formalization under them is made to foster flexibility. That is where any BIDS dataset can place relevant source data (from scanner or other BIDS datasets, BIDS raw or not) under sourcedata/ and then store produced derivatives (some of which might be BIDS) produced either from sourcedata/ ones (in case of "study") and/or current BIDS dataset itself (if not "study") - under derivatives/.
sourcedata/raw within study is just a suggestion since derivatives/ are for "derived from BIDS datasets" hence "raw BIDS" cannot be there by definition.

Going back to items

  • 1: it does not need to be all or nothing: "study" DatasetType just follows what BIDS defines already for "raw" and "derivative" types -- basic folder structure, files and metadata at the root. As soon as we formalize sourcedata/ or derivatives/ more -- it would benefit from that too. Similarly BEP on stimuli tries to further formalize stimuli/ folder.
  • 2: sorry -- not following there, may be partially due to weak knowhow on phenotype/
  • 3: what is StudyType -- we do not have that in bids yet I think?

yarikoptic avatar Oct 08 '25 21:10 yarikoptic

For the sake of moving forward with this PR, let’s only review and approve/disapprove of the changes made by @nikhil153 and move this bigger discussion to a dedicated meeting (maybe at distribits?). Wdyt @yarikoptic and @effigies, could you do a final review on Nikhil's changes?


Still, I'll post my answers here, to use this PR as the starting point for following discussions (and that I don't forget what I wanted to say...):

  1. I actually did not mean an "all or nothing" approach but my thoughts were very much like yours. At the very base, I assume bids-study means I should follow BIDS wherever is possible. Meaning, that if I have data in the derivatives folder, it should follow a BIDS spec IF there is one. If there is none, then it’s not in BIDS but that’s fine because we don’t have a spec for this (yet). So derivatives can hold both, non-BIDS and BIDS derivatives. My point here was more about if we need a bids- prefix for the folders or not. So "it is BIDS" is the umbrella assumption for everything that is in the study but not a hard requirement.

  2. This was just me voting for having a /phenotype folder at root level. Since the current phenotype BEP does not make any layout suggestions for /phenotype within bids-study, I feel we have some freedom here to decide where the /phenotype folder should/could be. I did not see anything in the current version of the phenotype BEP that would be violated by having /phenotype at root level in the bids-study layout.

  3. No, we don’t have that yet. We can talk about this another time and separate this discussion from this one!

Your other questions:

where do we place non-BIDS derivatives?

This I addressed in my reply to 1.

where, for a BIDS derivative dataset, do we place BIDS input datasets used to produce it * when they are "raw BIDS" * when they are "derivative BIDS" * when they are a mix of both

This is a point where you and I (and probably BIDS and I) don’t use the same meaning for sourcedata.

  • Sourcedata for me is any data in it’s barest form, like right after its „birth“. And only this.
  • Raw for me is non-processed data but data that is in a standard and interoperable format.
  • Derivative for me is everything that was processed, no matter if it was processed directly from sourcedata or raw or other derivatives.

For me it does not make a difference if this dataset was used as an input, it is NOT sourcedata if it is in the raw or derivatives form. Whereas you use sourcedata for anything that was an input for something, no matter of its form.

So, for me, the answers would be:

  • „where do we place BIDS input datasets used to produce it when they are „raw BIDS““ —> in /raw
  • „where do we place BIDS input datasets used to produce it when they are „derivative BIDS““ —> in /derivatives
  • „where do we place BIDS input datasets used to produce it when they are a mix of both“ —> separated in /raw and /derivatives.

We do have the Sources metadata for the derivatives, so it shouldn’t be an issue if inputs are in different places.

julia-pfarr avatar Oct 15 '25 19:10 julia-pfarr

2cents : the very flexible specification comes at the cost of loss of useful information in the standard, I would certainly make "raw"/"rawbids" the only (first?) place to search for a BIDS validated dataset, that would be useful for the tools and mental clarity : source data == not BIDS, raw/rawbids == BIDS

jbpoline avatar Oct 21 '25 09:10 jbpoline

I like the idea, obviously, but I wonder how this conflicts with the existing structure —i.e., how the validator behaves and whether 'we' want to differentiate cases. Concretely,

MyData/
      |- source
      |- derivatives
      |- sub-01

vs

MyStudy/
       |- bids-source
       |- bids-derivatives
       |- bids-raw
                  |- sub-01
  • source and bids-source are ignored
  • derivatives and bids-derivatives are validated for whatever parts follow BIDS
  • MyData (a BIDS dataset) is validated to same as bids-raw except for Modality agnostic files
  • MyStudy has an extra validation ? and Modality agnostic files are at this level

CPernet avatar Oct 24 '25 08:10 CPernet

@yarikoptic @effigies how is it going with this PR?

julia-pfarr avatar Nov 05 '25 16:11 julia-pfarr

Regarding phenotype, pulling it out of the primary dataset seems to run counter to BEP36, which aims to put additional constraints on phenotype. Because many of those rules hinge on relationships between phenotype files and root-level participants.tsv/sessions.tsv, I think this would relax those new rules almost entirely.

I also have a pair of questions:

  1. In each of sourcedata/, rawdata/ and derivatives/, is the contents of each of these directories supposed to be a dataset, or are they a collection of datasets, or are they potentially an arbitrary directory structure with datasets appearing somewhere within? Just to illustrate my meaning:

    study/
      sourcedata/
        dicoms/ <- second level
        physio/  <- second level
      rawdata/
        dataset_description.json <- first level
      derivatives/
        subdirA/
          subdirB/
            dataset_description.json <- third level
        subdirC/
          subdirD/
            subdirE/
              dataset_description.json <- fourth level
    
  2. Is a validator expected to find BIDS datasets in any/all of these subdirectories and validate them, or do these all remain opaque, and it's the responsibility of the curator/archive to validate subdatasets?

The simple answers: 1) The contents are arbitrary (at least as far as this PR is concerned); 2) No (at least as far as this PR is concerned).

You could go a bit further and say that rawdata/ is expected to be a BIDS dataset, which is a very straightforward directive for a toolwriter to follow (see study, look in rawdata/), but that then rules out the possibility of multiple raw datasets.

I realize these aren't problems Nikhil is introducing, but up to now, there has been a validator behavior for sourcedata/ and derivatives/: ignore their contents. When re-introducing rawdata/, it's an opportunity to think through whether BIDS has anything to say inside these directories.

effigies avatar Nov 06 '25 21:11 effigies

@effigies, Re: phenotype, if that's the case, then wouldn't it be an issue for the derivative DatasetType as well? Happy to discuss the implication in the BEP36 thread, as I am not sure if it currently considers the newer DatasetTypes (i.e. derivative and study).

Re: questions, my view would be: sourcedata : arbitrary rawdata: single BIDS raw DatasetType (i.e. no collection of datasets) derivatives: non-nested (i.e. no derivatives within derivatives) collection of BIDS derivative DatasetType.

I think allowing multiple raw datasets would create several complications that are better discussed at a later time - possibly under BIDS-MEGA BEP / BIDS2.0? For derivatives directory, allowing nested sub-directory structure as in your example is okay. As long there there is no nesting of Datasets themselves i.e. no nesting of dataset_description.json files. Hopefully, this would allow enough flexibility for pipeline outputs while avoiding recursion and keeping the validator logic relatively simple.

nikhil153 avatar Nov 27 '25 16:11 nikhil153