bids-specification [ENH] Add "study" DatasetType to organize a collection of source and derivative datasets

edit: formerly it was "project" but then renamed to "study" for better alignment

This PR was initially submitted as #1861 but I made a mistake to combine it with a discussion of transformations of existing projects' layouts into such BIDS project dataset. Please refer to that PR for examples but otherwise let's concentrate here on the discussion of this specific proposed change.

Rationale 1 (major): BIDS standard already provides reasonable structure to formalize organization of various components of a neuroscientific data project: where to place code, original (source) data, derivative data, README, CHANGES. Many projects (e.g. nipoppy, YODA, etc) propose similar and often might be even "inspired" templates . If we explicitly allow for BIDS standard to prescribe study level organization, IMHO it would help many people and projects decide on how to organize their studies/projects.
- See #1861 on some examples and initial discussions on possible "transformations". But this PR has merit regardless whether individual existing projects adopt it partially or fully.
Rationale 2: IMHO BIDS standard should describe only what standard prescribe and not recommend some potential "non-standardized" layouts. That is why I "reworked" that example into a legitimate BIDS dataset merely by adding dataset_description.json.

TODOs:

[x] provided more relation to existing approaches (mostly by @snastase in #1861)
[x] crafted example for bids-examples : https://github.com/bids-standard/bids-examples/pull/451
- [x] ensure bids-validator with modified schema passes its validation

Oct 29 '24 21:10 yarikoptic

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 82.71%. Comparing base (5a357dd) to head (1fe00d9). :warning: Report is 143 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1972   +/-   ##
=======================================
  Coverage   82.71%   82.71%           
=======================================
  Files          20       20           
  Lines        1608     1608           
=======================================
  Hits         1330     1330           
  Misses        278      278

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Jan 23 '25 02:01 codecov[bot]

FWIW, we conversed with @effigies and he brought up an interesting argument, although IMHO not contradicting this one per se, is that ATM any BIDS dataset (raw or derivative) which already contains some subdatasets under derivatives/ could be considered to be a "project" BIDS dataset. Indeed it is true. But IMHO it is just a legacy side-effect of BIDS recommendation to stick derivatives under derivatives/ which somewhat invalidate the claim that the BIDS dataset as a whole is a "raw" dataset.

edit: related linked below is #2103 highlighting the same situation with "raw" dataset containing "derivatives/"

Mar 01 '25 01:03 yarikoptic

@effigies I wonder if we should extend DatasetType to become a list or replace or compliment with DatasetTypes which could then indeed be just "project" when containing other derivative bids-datasets or sourcedata, or ["raw", "project"] when also a raw and project, or could be also ["derivative", "project"] when derivative but points to sourcedata/ original "raw" ... ?

Apr 23 '25 14:04 yarikoptic

I'm skeptical of that need. I would expect your ["raw", "project"] to look like:

project/
  dataset_description.json  # "DatasetType": "project"
  rawdata/
    dataset_description.json  # "DatasetType": "raw"
  derivatives/
    preproc/
      dataset_description.json  # "DatasetType": "derivative"

And ["derivative", "project"]:

project/
  dataset_description.json  # "DatasetType": "project"
  sourcedata/
    dataset_description.json  # "DatasetType": "raw"
  rawdata/
    dataset_description.json  # "DatasetType": "derivative"
  deriv/
    analysis/
      dataset_description.json  # "DatasetType": "derivative"

Subdatasets should be validatable BIDS datasets in their own right, avoiding the need for a top-level dataset_description.json to modify how they are intended to be validated.

Apr 25 '25 12:04 effigies

I think this overall needs more specification. What are valid directories in a project-type dataset? We should add them to https://github.com/bids-standard/bids-specification/blob/master/src/schema/rules/directories.yaml.

I think a project dataset is barely worth specifying if we don't validate at least the raw data subdataset. Possibly we should have rules for indicating where validators should look for subdatasets.

In OpenNeuroDerivatives, we use sourcedata/* as a mixture of BIDS and non-BIDS (FreeSurfer) inputs. Nipoppy uses derivatives/<pipeline>/<version>/<output> as derivative datasets; presumably this could also be a mixture of BIDS and non-BIDS.

Apr 25 '25 13:04 effigies

yet to "process" but a quick side idea inspired by #1928 --- I wonder if there is a hierarchy here: project (everything common) -> raw (current default, requires having sub- folder(s)) -> derivative (more stuff could be added), as every next level adds capabilities but includes all of the prior one as derivative could include raw in it? or we have already something which invalidates that?

Apr 25 '25 20:04 yarikoptic

I think this overall needs more specification. What are valid directories in a project-type dataset? We should add them to https://github.com/bids-standard/bids-specification/blob/master/src/schema/rules/directories.yaml.

They are already there, that's somewhat the point here -- that we are already defining the structure of all those folders, nothing new to add.

I think a project dataset is barely worth specifying if we don't validate at least the raw data subdataset. Possibly we should have rules for indicating where validators should look for subdatasets.

In OpenNeuroDerivatives, we use sourcedata/* as a mixture of BIDS and non-BIDS (FreeSurfer) inputs. Nipoppy uses derivatives/<pipeline>/<version>/<output> as derivative datasets; presumably this could also be a mixture of BIDS and non-BIDS.

I think presence of the subdatasets is not really the differentiation here, and formalization of rules for their validation is orthogonal to this issue. Having in mind my prior observation that "raw" is pretty much "a project with data in sub-* folders" we might be circling back to that issue of requiring sub- folders in "raw" BIDS datasets? ;-) and otherwise they are just "BIDS project" datasets, which could contain sourcedata/ and derivatives/ subdatasets, logs/, code/ etc, but not sub-* subfolders with actual data?

Could even kinda become nice that we would facilitate people to even start their "raw BIDS datasets" as "project BIDS datasets" where they plan (README, code/ etc) until they start populate with data and thus becoming raw?

May 19 '25 17:05 yarikoptic

@effigies Following your idea, I have now added a "warning" (to reflect level of the analogous SubjectFolders check in "raw" BIDS). I guess, in principle, we could take this as an opportunity to revert SubjectFolders check back to "error" and demand making it "project" type, and then make it also "error" in project type. WDYT?

Jun 10 '25 12:06 yarikoptic

While discussing with @jbpoline we wondered, if may be study would be a better descriptor to use here in favor of project. One of the rationales, is that e.g. in BEP035 (attn @bids-standard/bep035) on Mega-analysis they introduce study- entity as a groupping element. It kinda then would match natively.

we also mention "study" in various places in BIDS which seems to align nicely here

❯ git grep study
src/CHANGES.md:-   \[FIX] update physio bids name in longitudinal study page examples [#863](https://github.com/bids-standard/bids-specification/pull/863) ([Remi-Gau](https://github.com/Remi-Gau))
src/appendices/coordinate-systems.md:The following template identifiers are RECOMMENDED for individual- and study-specific reference
src/appendices/coordinate-systems.md:In the case of multiple study templates, additional names may need to be defined.
src/appendices/coordinate-systems.md:| study                 | Custom space defined using a group/study-specific template. This coordinate system requires specifying an additional file to be fully defined.                                                                                                                         |
src/appendices/hed.md:numerical values that are similar across the recordings in the study.
src/appendices/hed.md:repository on GitHub should be used to validate the study event annotations.
src/common-principles.md:    unless when appropriate given the study goals, for example, when scanning babies.
src/introduction.md:> The data used in the study were organized using the
src/modality-specific-files/genetic-descriptor.md:     "Dataset": "https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001364.v1.p1",
src/modality-specific-files/intracranial-electroencephalography.md:Note that the date and time information SHOULD be stored in the study key file
src/modality-specific-files/magnetic-resonance-spectroscopy.md:acquisition parameters in filenames is helpful or necessary to distinguish datasets in a given study.
src/modality-specific-files/motion.md:Note that the onsets of the recordings SHOULD be stored in the study key file [(`scans.tsv`)](../modality-agnostic-files.md#scans-file).
src/modality-specific-files/positron-emission-tomography.md:This entity is OPTIONAL if only one tracer is used in the study,
src/modality-specific-files/task-events.md:Please mind that this does not imply that only so called "event related" study designs
src/schema/objects/common_principles.yaml:    A set of neuroimaging and behavioral data acquired for a purpose of a particular study.
src/schema/objects/common_principles.yaml:    Session can (but doesn't have to) be synonymous to a visit in a longitudinal study.
src/schema/objects/common_principles.yaml:    A person or animal participating in the study.
src/schema/objects/entities.yaml:    For example, this should be used when a study includes two T1w images -
src/schema/objects/entities.yaml:    Session can (but doesn't have to) be synonymous to a visit in a longitudinal study.
src/schema/objects/entities.yaml:    A person or animal participating in the study.
src/schema/objects/enums.yaml:study:
src/schema/objects/enums.yaml:  value: study
src/schema/objects/enums.yaml:  display_name: study
src/schema/objects/enums.yaml:    Custom space defined using a group/study-specific template.
src/schema/objects/metadata.yaml:    Reference to the study/studies on which the implementation is based.
src/schema/objects/metadata.yaml:    The version of the HED schema used to validate HED tags for study.
tools/schemacode/src/bidsschematools/tests/data/broken_dataset_description.json:"EthicsApprovals": ["The original study from which this BIDS example dataset was derived was approved by the Ethics committee of Ghent University Hospital with identifier EC 2017/1103."]

and "project" mentionings are not particularly aligned. So, I think, we should just make it a "study", hence renaming accordingly.

Jun 12 '25 08:06 yarikoptic

A functional question for the validator: Should we error if there are no valid BIDS datasets in sourcedata/?

cc @rwblair @nellh

Jun 16 '25 15:06 effigies

sorry, I do not see how this relates to this PR since having a "valid BIDS datasets in sourcedata/" seems to point to be a "derivative BIDS" dataset by its definition.

edit: the point is that "study" dataset could even be empty to start with, start collecting various other sourcedata/ and eventually do get sourcedata/raw BIDS dataset. I think it should remain valid through those life cycles.

Jun 16 '25 15:06 yarikoptic

@effigies who, among maintainers, do you think might also be interested to review this PR?

Jun 16 '25 15:06 yarikoptic

But what are you validating if there are no valid subdatasets? Why are you running the validator?

Jun 16 '25 16:06 effigies

dang... I guess I mixed up this PR with something else since I thought that we have merged it!

But what are you validating if there are no valid subdatasets? Why are you running the validator?

I did some edit above, may be it came during your question:?

edit: the point is that "study" dataset could even be empty to start with, start collecting various other sourcedata/ and eventually do get sourcedata/raw BIDS dataset. I think it should remain valid through those life cycles.

so the point is that you would validate what is available -- that there is no sub- folders, that dataset_description.json is proper with correct DatasetType, etc. Later I will adjust

https://github.com/bids-standard/bids-examples/pull/451

to get some examples.

Jul 25 '25 15:07 yarikoptic

@effigies could you invite or nominate a few other PR reviewers who might potentially be interested? ;-)

Aug 07 '25 18:08 yarikoptic

Hi @yarikoptic @effigies, I am new to the BIDS release process. What are the next steps to get this merged and released as part of the latest BIDS spec? Thanks.

Aug 13 '25 15:08 kabilar

2 maintainers + 1 (associated) contributor approvals. How many more to request/wait for ? ;)

Aug 20 '25 16:08 yarikoptic

One trivial click for a maintainer, one giant leap for the BIDSkind!

Thank you @julia-pfarr ! ;-)

Aug 21 '25 14:08 yarikoptic

bids-specification bids-specification copied to clipboard

[ENH] Add "study" DatasetType to organize a collection of source and derivative datasets

Codecov Report

bids-specification
bids-specification copied to clipboard