bids-specification icon indicating copy to clipboard operation
bids-specification copied to clipboard

Defining Criteria for Data Types and Modalities in BIDS

Open neuromechanist opened this issue 11 months ago • 21 comments

Background

The Brain Imaging Data Structure (BIDS) specification currently distinguishes between data types (represented as subdirectories under each subject) and modalities (represented as file suffixes). However, there appears to be inconsistency in how these distinctions are made across different kinds of data.

Some examples of the current state:

  • anat is a data type for MRI anatomical recordings, with t1w, t2w, etc, as modalities
  • eeg, ieeg, and meg are separate data types and modalities for different neural recording methods
  • motion exists as its own data type and modality
  • physio exists as a modality but not yet as a data type (being addressed in BEP045, #1675)
  • Eye tracking is being added as a recording type under the physio modality (BEP020, #1128), with standalone eye tracking placed under the beh data type
  • emg data type and modality is being proposed (BEP042, #1998)

Current Discussion

There is an ongoing discussion under #1998 about whether certain data should:

  1. Have their own dedicated data type and modality
  2. Be incorporated under an existing umbrella data type
  3. Be embedded within other modalities when appropriate

The discussion initiated as to whether EMG should:

  • Have its own data type and modality (similar to EEG/MEG/iEEG)
  • Be incorporated under the physio modality (similar to eye tracking in BEP020)

However, I believe that the scope of the issue is larger than EMG, and appreciate the community to provide their inputs.

Key Considerations

When determining whether a data category deserves its own data type/modality or should be incorporated under an existing umbrella, the following factors have been raised:

  1. Signal source and nature: Is the signal brain-derived vs. peripheral? Neural vs. non-neural?

  2. Data dimensionality and complexity: Does the data have unique requirements in terms of channel count, sampling rate, or format that make existing structures insufficient?

  3. Research usage patterns: Is the data commonly used as a standalone dataset, or primarily as an auxiliary measurement to other data types?

  4. Technical requirements: Does the data require specific metadata fields, coordinate systems, or other specifications that don't align with existing structures?

  5. Community needs: Is there sufficient research activity and community interest to warrant a dedicated structure?

  6. Fragmentation concerns: Does creating a new data type/modality risk fragmenting the BIDS ecosystem unnecessarily?

  7. Consistency with existing structures: How would the decision align with precedents set by other data types?

The case of EMG-BIDS

For EMG data (BEP042), arguments for a dedicated data type include:

  • EMG data is often high-dimensional (>200 channels with 2+ kHz sampling)
  • EMG can target multiple muscles and requires specific placement information
  • EMG can directly derive/estimate neural discharges
  • There is significant standalone EMG research
  • Multiple EMG devices can record simultaneously
  • EMG closely follows other electrophysiology data types (EEG/iEEG/MEG) and current research closely relates the signals to neural activity.
  • Motion-BIDS is a standalone data type and modality suggesting that not all BIDS datatypes and modalities should be "brain-related."

Arguments for incorporating EMG under physio:

  • Creating new modality suffixes can fragment BIDS
  • Similar physiological signals (e.g., eye-tracking) are managed under physio
  • The BEP020 approach with PhysioType field could accommodate EMG-specific metadata
  • Consistency with how other non-brain physiological recordings are handled (EKG, Eyetracking under physio)

Questions for the Community

  1. What should be the threshold criteria for creating a new data type vs. using an existing one?

  2. Should brain-derived signals be treated differently from other physiological signals? If yes, how this differentiation applies to the current specifications, including Motion-BIDS and ongoing PRs.

  3. How should we balance the need for specificity against the risk of fragmentation?

  4. Should we establish a formal policy for what constitutes grounds for a new data type/modality?

  5. How can we ensure that similar types of data (e.g., various physiological recordings) are treated consistently across the specification?

  6. What is the threshold or recommendation for using the data specific modality/recording versus embedding data under other modalities for example, Eye-tracking and EMG can be embedded under EEG as channels.

Next Steps

This discussion has implications beyond just EMG data and could affect how future data types are incorporated into BIDS. We could also consider whether:

  1. A formal policy document should be developed
  2. Existing data types should be reviewed for consistency
  3. A dedicated BEP should address this foundational question

Community input from researchers working with diverse data types, and stakeholders @bids-standard/steering, @bids-standard/maintainers, @bids-standard/raw-eyetracking, @bids-standard/bep042, @smoia, @m-miedema, @arnodelorme is essential to ensure BIDS remains both comprehensive and coherent.

neuromechanist avatar Apr 29 '25 15:04 neuromechanist

Hi Yahya,

This is something I also raised back in 2020, but then rather from the perspective of the instruments used to record the data, rather than the features of the data. See this google doc, with good comments from various people. Looking back at this and considering the EMG use-case that triggered it for you, I would say that EMG is recorded with a "biopotential amplifier".

Back then I did not manage to come up with an answer to the issues that were raised then and that you also raise here, but perhaps the arguments provided in the google doc might still be helpful.

best regards, Robert

robertoostenveld avatar Apr 29 '25 19:04 robertoostenveld

Hi @neuromechanist, following up on todays discussion in the maintainers meeting: We are going to draft an email with the most important points and pointing to existing issue(s)/doc(s) to send out to the BIDS community for discussion. We believe this is an important first step to gather more substance for forming a decision in the near future. Hope this is in your interest!

julia-pfarr avatar May 15 '25 18:05 julia-pfarr

Thanks very much @julia-pfarr. Yes please. I agree that we might not be able to reconcile years long discussion in a short time period. But, we have the EMG-BIDS to help consider the issue with concrete example, use-cases and consequences.

I tried to generalize the points from EMG-BIDS discussion above to consider the larger picture. I think both ways (discussing the general case or just this special case) works well.

neuromechanist avatar May 15 '25 19:05 neuromechanist

This is a great start to considering this issue with a broader lens, thank you!

One of the main points we discussed in the case of physio data is that it can be unclear to know where to look for this data if it is not treated as a data type as well as a modality. For example, we can consider a dataset in which simultaneous EEG-fMRI were recorded, along with concurrent physiological data. Currently, as an auxiliary measure, we understand that the physiological data could be arbitrarily placed in either the eeg or func folder. We think that a good reason to treat physio as a data type is to reduce this ambiguity, but it would be good to understand if similar concerns are being raised in this or other discussions!

m-miedema avatar May 21 '25 01:05 m-miedema

We faced a similar issue in BEP024 regarding Computed Tomography (CT). We asked if CT data might fall under the anat data type, as it is anatomical data, or under a specific ct data type (we did not address the possibility of using the pet data type for PET/CT acquisitions, as we want to handle CT data acquired without PET). Since the anat data type is used for MRI data, we concluded that using a new data type would be more consistent, as CT is a different imaging modality. This makes MRI treated differently than other imaging modalities.

Having a policy document on existing data types and what to consider when adding a new data type is important, as it has been mentioned.

Hboni avatar Jun 04 '25 08:06 Hboni

I love this discussion and discussion points. Though I am concerned a GitHub issue may not be the best place to explore the answers? I think if we want to get to the bottom of this without a huge GitHub issue discussion post (that makes it hard to read and follow along), we should consider something like a CryptPad survey or Google Form for the 6 major questions above, where all responses are public to catch all the context.

Then I think we (the BIDS maintainers) could host a more organized open meeting for the BIDS community to hear the results of the survey and discuss each point and proposed solutions one by one. Either that or perhaps bring the proposed solutions back to the community here on the https://github.com/bids-standard/bids-specification/discussions page? One discussion for each of the 6 key questions above.

Respond with a thumbs-up emoji here if you would prefer that method to try and resolve these key questions. Otherwise, of course, I am still open to continuing the discussion here if that is the community's preference.

ericearl avatar Jun 04 '25 13:06 ericearl

Dear Yahya,

I'm in charge of the BEP020 about eyetracking (I hope we can finish it very soon). However these answers reflect just my opinion and not necessarily the one of the BEP020 group.

What should be the threshold criteria for creating a new data type vs. using an existing one? That could be nice indeed, as when we started the BEP020 we first just looked at what was done before and indeed decided to create a new type, before realizing that in our case it wasn't really necessary nor optimal.

Should brain-derived signals be treated differently from other physiological signals? If yes, how this differentiation applies to the current specifications, including Motion-BIDS and ongoing PRs. I think this is a tricky point, as what is a brain-derived signals, in the case of eyetracking it is the position of the eyes but researchers believe that eye position and pupil is a proxy to brain states, eyes are themselves part of the brain... I'm not sure this is where the debate should go.

How should we balance the need for specificity against the risk of fragmentation? Such a tricky question, fragmentation in long run will potentially be a very big burden, my opinion that motivate our decision of not creating the new eyetrack entity was purely simplicity. Also eyetracking these days, is a particular modality that often is the additional measure, together with fMRI, EEG, behavior, etc... it was another argument in favor of putting it with the "main" recording modality.

Should we establish a formal policy for what constitutes grounds for a new data type/modality? Yes

How can we ensure that similar types of data (e.g., various physiological recordings) are treated consistently across the specification? What is the threshold or recommendation for using the data specific modality/recording versus embedding data under other modalities for example, Eye-tracking and EMG can be embedded under EEG as channels. These are specific questions I will not try to answer at that point.

In general my opinion is just that as long as it is possible to do with the minimum of changes (no data type), it is the way to go. Note that eyetracking is specific. I feel that the choice we made was the good one. I first wasn't convinced, as I felt it was a bit reducing the impact of that BEP, but at the end what matters is to optain a standard for open science.

Best,

mszinte avatar Jun 04 '25 14:06 mszinte

we should consider something like a CryptPad survey or Google Form for the 6 major questions above, where all responses are public to catch all the context.

Either that or perhaps bring the proposed solutions back to the community here on the https://github.com/bids-standard/bids-specification/discussions page?

+1, Reaching out to the community with the questions as a Google Forms etc surely adds to the visibility. The downside is that people can't read each others opinion, but I think we can link the discussions to this issue, so that everyone can follow up and contribute.

Still, everyone is welcome to point out your views here, if you don't want to wait for the Google Form to come in.

neuromechanist avatar Jun 04 '25 15:06 neuromechanist

I, alongside @wtclarke and @martin3141, worked on BEP022 for MRS. While it was clear to us from the outset that MRS, despite being a nuclear magnetic resonance technique like MRI, should be its own data type in BIDS. Discussions centered more on modalities, specifically whether single-voxel MRS and MRS imaging should have separate suffixes. Ultimately, we decided that they should, given that single-voxel MRS and MRSI are normally utilized as standalone MRS modalities in experiments.

1. What should be the threshold criteria for creating a new data type vs. using an existing one?

If a method is sufficiently used and reported on in the literature to be considered a standalone biophysical data collection technique. For example, if an entire subdiscipline has grown out of this method; there are specialist journals, textbooks, and scientific meetings focused on that method; one could reasonably call oneself an "x" scientist when speaking in lay terms, e.g., "I'm an MRI scientist" as opposed to "I'm a diffusion MRI scientist" (though the latter does make sense if you're in MR research); etc. (These are just examples and not meant to be firm criteria.)

2. Should brain-derived signals be treated differently from other physiological signals? If yes, how this differentiation applies to the current specifications, including Motion-BIDS and ongoing PRs.

I'm not entirely sure what this question is getting at. I'm assuming they will be because the methods to detect them are technologically distinct?

3. How should we balance the need for specificity against the risk of fragmentation?

This one's tricky. I wonder if user demand will determine the balance. That is, if a method user makes the case that a certain method should be its own data type in BIDS, demonstrating why an affiliated method's BIDS specification does not share important characteristics of the user's method's data and metadata.

4. Should we establish a formal policy for what constitutes grounds for a new data type/modality?

Yes.

5. How can we ensure that similar types of data (e.g., various physiological recordings) are treated consistently across the specification?

I'd say @effigies and I went back and forth a number of times on just this for MRI and MRS. I think the review of BEPs from those outside the BEP team handles this well enough. But it's essential that an expert user of a related modality, who may not necessarily be a BIDS maintainer, also reviews the BEP.

6. What is the threshold or recommendation for using the data specific modality/recording versus embedding data under other modalities for example, Eye-tracking and EMG can be embedded under EEG as channels.

Tricky. I think my answer to this follows my answer to question 1.

markmikkelsen avatar Jun 06 '25 12:06 markmikkelsen

I suggest that a data type in BIDS should warrant its own folder if it can be distributed and meaningfully interpreted independently of other modalities. By this criterion, both EMG and eye tracking merit their own folders. Using two mechanisms—sometimes assigning standalone folders, other times embedding as modalities—introduces inconsistency and complicates data reuse.

A key challenge remains synchronized acquisition across modalities. Without a formalized BIDS mechanism to declare temporal alignment between folders, it’s difficult to ensure data integrity when modalities are split. A potential solution is to define a standard for synchronization markers or events shared across modalities, allowing cross-referencing in a principled way as for BIDS motion capture. Formalizing such criteria and synchronization strategies would enhance clarity, consistency, and interoperability within BIDS.

arnodelorme avatar Jun 06 '25 14:06 arnodelorme

@arnodelorme @mszinte @markmikkelsen @m-miedema and others who will contribute to this issue until next week:

Thank you all for sharing your opinion on this, this is incredibly helpful! We will discuss this issue also at the upcoming BIDS maintainer meeting next week. The discussion will happen on Tuesday, June 10th at 15:30h CET.

If you'd like to join, please sign up in this form and I'll send you the meeting link.

julia-pfarr avatar Jun 06 '25 16:06 julia-pfarr

Thanks @neuromechanist for bringing up this discussion. I'll follow your proposed framework to offer my 2ct. That said, I think the framework is well defined for data types (folders under sub- or ses-) and insufficient for modalities (suffixes).

  1. Signal source and nature: Is the signal brain-derived vs. peripheral? Neural vs. non-neural?

IMHO, this is one important factor to consider the data type: for example, it's confusing to me that motion has its own data type. Within the brain imaging data structure I'd would expect brain data types. As @mszinte mentioned above, from an ontological point of view this is pretty hard, however, and as in other contexts, we just need a definition that works for BIDS. This discussion feels similar to the deep debates we are having in BEP038 about what atlases and templates are or aren't. Ultimately, we must be practical and ensure that these concepts are clear and operational within the BIDS specifications. Unfortunately, this is a repeated pattern across BEPs: they kind of make the humongous effort of establishing authoritative ontologies about the data, when we should be extremely utilitarian and deliver something that works, even if some people would argue, e.g., that those are "brain" data despite BIDS not considering them directly derived from the brain (or vice-versa).

I don't see a reason to differentiate between neural and non-neural, but I understand that could give base to a protocol for when a new suffix should be added.

  1. Data dimensionality and complexity: Does the data have unique requirements in terms of channel count, sampling rate, or format that make existing structures insufficient?

I think this is fundamentally a problem for the format, not the structure. Suffixes can have multiple extensions, which is what ends up hinting the format. This item should not be considered in this discussion, and rather another discussion about how to decide on formats.

  1. Research usage patterns: Is the data commonly used as a standalone dataset, or primarily as an auxiliary measurement to other data types?

I don't think this applies by much. But if the data is non-brain and commonly used as a standalone dataset, my thinking is BIDS should not try to organize that.

  1. Technical requirements: Does the data require specific metadata fields, coordinate systems, or other specifications that don't align with existing structures?

Like formats, this is not a factor for data type. Arguably for suffix, and BEP020 demonstrates that you can keep BIDS' current principles, modalities, and suffixes to encode very metadata-rich data such as eye tracking.

  1. Community needs: Is there sufficient research activity and community interest to warrant a dedicated structure?

Not sure how this is going to operationally be factored in. How do you measure community need / research activity / community interest? And even if you could measure/operationalize this, what is the heuristic: more than 1000 papers a year justify a new data type or suffix or both?

  1. Fragmentation concerns: Does creating a new data type/modality risk fragmenting the BIDS ecosystem unnecessarily?

This is key, and I introduced this concern in the original discussion. With the exception of the proposed data type physio/, most of the BEPs proposing new data types are also proposing new suffixes. To me, that's a symptom that, first, data type and suffixes' role in BIDS is unclear (or, in the most optimistic scenario, unclear to many specific BEP leads).

In the specific case of EMG, both are proposed. In the absence of physio/, I can see that, if EMG is very rarely collected simultaneously with a brain imaging modality AND assuming these data do not qualify as beh/, then a new modality is proposed. Now, moving on to the suffix and as @robertoostenveld mentions above, EMG could qualify as "biopotential amplifier" signal, so I would understand a new suffix to represent all of those (however, I have the feeling that would not actually land well with any community). It begs the question of why _physio is not an appropriate suffix for these signals? BEP020 demonstrates that eye-tracking can be completely represented data- and metadata-wise with current BIDS specifications (in a way, BEP020 just makes the validator more aware of eye-tracking specificities so that these datasets are validable and properly encoded).

One problem is that current specifications have hit fundamental limitations with the compressed TSV format. While BEP020 directly addresses the issue by defining a new generic suffix _physioevents, most of the BEPs I've read that faced this problem are choosing to create new suffixes and application-specific formats.

To me, this is the crux of the problem and a clear consequence of fragmentation: instead of improving BIDS and rid it of this problem across the board (read across data type structures), we are using data type and suffixes to make the problem away for each specific part of the standard. @neuromechanist mentioned Motion-BIDS, which I think represents this problem clearly:

  • Motion-BIDS is a standalone data type and modality suggesting that not all BIDS datatypes and modalities should be "brain-related."

I would read this the other way around. Because this discussion has not been brought forward until now (thanks @neuromechanist!), the motion datatype (sorry, I dislike adding the prefix/suffix BIDS- to everything) has fragmented BIDS (and introduced other idiosyncrasies such as the compressed TSV with headers). Rather than a precedent, this is an example of how sprawling datatypes (and then suffixes) will create divergent specifications that are harder to maintain because one of the reasons to do that is introducing idiosyncratic formats (also discussed in BEP042).

The worst part of this fragmentation is that the validator cannot check these new modalities/suffixes: today, with motion having its specific datatype and suffix, I can encode the same dataset with _physio, make it pass the validator, and all that without missing any bit of data/metadata (and I'm up to be challenged with a dataset and demonstrate this).

You'll be asking, well, a similar thing can be said of BEP020. Sure, but with BEP020, eye tracking data have exactly the same structure with current BIDS specs and under BEP020. So, if the user wants to go with an old BIDS' version, they are facing (for the most part, happy to give more details on this which justified proposing _physioevents) the problem of ensuring all eye-tracking metadata defined as REQUIRED by BEP020 are correctly encoded for the dataset to be reusable, while the validator will not tell them anything of the problem.

  1. Consistency with existing structures: How would the decision align with precedents set by other data types?

This question is biased towards the following argument: because motion has its modality, now the rest should have a modality. It's clear that a decision limiting new data types and/or suffixes will not align with the precedent. For the arguments above, IMHO precedent makes a poor standing point for the decision. The motion BEP was put forward before the community became aware of the issues involved, and that should not influence the decision today.

Although at this point it should be clear that I think new data types and suffixes should not be created arbitrarily, if a final decision is made to allow this, I think it is critical at minimum that all these BEPs are required to provide a precise mechanism so that the new data type / suffix cannot be encoded (and valid as per the validator) as the former data type/suffix.

oesteban avatar Jun 09 '25 06:06 oesteban

A few points of discussion that fall outside @neuromechanist's proposed framework:

  1. How can we ease the development of BEPs? Sunk within my wall of text there is something that IMHO is and will continue to stifle BIDS' progress moving forward. I'm confident that the tension between defining things and delivering an effective specification is stalling BEPs across the board. This is something that was already happening in the early days, but now it has become a central point of contention.
  2. BEP020 is incompatible with BEPs proposing to unfold new datatypes or suffixes from physio. If motion is looked at under the lens of a "precedent", BEP020 should too. BEP020 has been under the development for a long long while and is mature (we are finishing up BIDS-examples and addressing a sophistication of the validator). It would be regrettable that this debate was held without having everyone involved soak in BEP020 to make a truly informed decision.
  3. These decisions are more important than specific BEPs In the above, please do not understand my view of the motion BEP as dismissive. I think it is the result of the effort and best intentions of all people involved, so I do respect the work. In general, it is my impression that BEPs are all facing frictions derived from the tension between the best intentions of people to push extensions and improvements and the foundations of BIDS, with some parts of the original vision lost with Chris G's departure. On my side, I feel guilty for involving myself in these conversations (e.g., BEP020, BEP038, BEP042, etc.) a tad too late. However things happened, I want to raise awareness that this decision faces BIDS with a defining moment.
  4. Data formats (the logic behind choosing/adding/dropping thereof) should also be discussed before any BEP is merged @neuromechanist's framework does not allow much depth on the related debate about formats, which also was mentioned within EMG. I would recommend the community to discuss that one. We need to try to provide general guidance, and then discuss specifics (perhaps starting with tsv[gz], vs parquet, vs hdf5/zarr/other?). As I mentioned in the EMG BEP, I think data formats should be forward-looking (i.e., optimized for processing and adoption by neuroimaging libraries) as opposed to backward-looking (i.e., trying to resolve the device's maker problem of efficiently storing a stream of data without requiring crazy resources or missing relevant data/metadata for formatting/technical reasons).

oesteban avatar Jun 09 '25 07:06 oesteban

@oesteban can you clarify a couple points? I'm having trouble pinning down exactly what you would recommend / prefer.

first confusion: existing definition of "data type"

I think the framework is well defined for data types (folders under sub- or ses-) and insufficient for modalities (suffixes).

Are you referring to the "common principles" definitions, e.g., data type and modality? To me that data type definition seems extremely permissive, and notably says nothing about "brain" vs "not brain" or what kinds of data are allowed (it only has a list of what exists so far). In contrast, the definition of modality does refer to "brain data" but (as you imply) is somewhat imprecise, and to me seems inconsistent (e.g., _physio is a modality but anything under _physio is by definition not "brain" data right?)

So, in what sense do you feel that "data type" is well-defined, and have I correctly characterized your critique of the "modality" definition?

second confusion: brain vs neural

Within the brain imaging data structure I'd would expect brain data types.

vs

I don't see a reason to differentiate between neural and non-neural

Are you meaning here to say that measures of the peripheral nervous system (like EMG) are still "not brain" and therefore should not have their own data type? Just want to make sure you're using the terms the same way I would.

drammock avatar Jun 09 '25 21:06 drammock

Thank you all for the thoughtful responses! It's encouraging to see how invested our community is in BIDS' well-being. After reading through the comments, a few live discussions, and some good old reflection, here are my thoughts:

BIDS is primarily for "brain-related" data, which I see as a perspective rather than an inherent property of the data. Motion is a datatype in BIDS because it informs brain activity research; I doubt pure biomechanists would use BIDS for data sharing (they have alternatives like addbiomechanics.org). Similarly, eye-tracking data in BIDS serves a brain research perspective, despite extensive eye-tracking efforts in AR/VR that wouldn't lilkely use BIDS. IMO, as long as data serves as a window to the brain, it falls within BIDS' purview.

Following @mszinte's point, I wonder if we might consider a simplicity principle when addressing these questions. If we base decisions about datatypes/modalities on how simple it is to curate, manage, and reuse data, the answers to questions 1-3 may become clearer.

Simplicity could be quantifiable:

  • A. Minimize files needed (excluding sidecars) to describe new data
  • B. Keep datatypes, modalities, and entities minimal while reusing them for new data (of course metadata fields change with new data added to the spec)
  • C. Avoid congestion in datatypes. The majority of data files in a host datatype (that is, a directory containing directly-related data as well as additional data files) should be directly related to that datatype
  • D. Balance points B and C
  • E. Keep data with shared characteristics (common clock, recording instrument, etc) in one file unless specific features (such as task, run, data source) warrant separation

Consider Motion-BIDS: data from a single motion system uses three files (data, channels, events) plus sidecars, with multiple systems identified by tracksys. It introduces one datatype and essentially one entity while reusing important resources (events and channels) and avoiding congestion in other datatypes, so, probably simpler than using a host datatype.

Conversely, Eyetrack-BIDS proposal describes data using two files (data and events) with one new entity (physioevents), which seems simpler than creating a separate datatype. However, per-eye data file requirement (left, right, cyclopean) could add six files to the host datatype. For datasets like HBN-EEG (9 tasks, 12 runs), adding eye-tracking could triple the file count; potentially less simple. (I'll discuss this further in #1128)

This perspective might provide a unified view of BIDS specifications, making questions 4-6 clearer:

  1. Formal policy: Decide early in BEP development whether a datatype/modality is needed, based on file requirements, actual examples of data use, and which approach is simpler.

  2. Consistent treatment: Strive for findable data while avoiding both host datatype congestion and datatypes with minimal files.

  3. Threshold recommendations: Consider EKG as an example; it can be a channel in EEG, a physio modality in the host EEG datatype, or standalone in a physio datatype. The simplicity principle could guide recommendations from both curator and user perspectives: EKG recorded with the same EEG instrument might be simpler as a channel; standalone EKG with different instruments might be best in the host datatype; but multiple physiological recordings (eye-tracking, GSR, metabolics, SpO2, EKG) might warrant using a dedicated physio datatype.

I look forward to discussing this further on Tuesday, June 10th at 15:30h CET.

neuromechanist avatar Jun 10 '25 01:06 neuromechanist

@oesteban, I did not discuss EMG and its data format here to avoid sidelining the main questions (I used EMG only as a case example to bring up the general point).

But, briefly let's assume EMG uses EEG as a host datatype:

  • EMG (#1998) needs data, events, channels, electrodes, and coordsystem per device to sufficiently describe the data, which is at least as much data files as EEG has. So, EMG will easily overwhelm the host datatype.
  • It is not uncommon to have two EMG systems per person (it is uncommon to have two EEG or two eye-tracking systems per person). Not only this creates congestion in any host datatype, but also using the physio modality and merely using recording-<label> is insufficient, IMO, to indicate that the data is both EMG and that there are multiple EMG devices with different characteristics at play.

In short, I believe that having a separate datatype for EMG makes data curation and reuse simpler, and only adds one major change (one datatype, zero entity) to the spec (not considering the modality-specific changes to the metadata fields).

Discussion about file format might be better served at #2055. I'd be happy to discuss both points further offline or under their respective PR/Issue.

neuromechanist avatar Jun 10 '25 02:06 neuromechanist

@drammock:

Are you referring to the "common principles" definitions, e.g., data type and modality?

By "the framework" I was referring to @neuromechanist proposed structure to discuss the matter.

Are you meaning here to say that measures of the peripheral nervous system (like EMG) are still "not brain" and therefore should not have their own data type?

No, what I said is that BIDS needs to create the governance mechanisms to make these decisions. A possibility is to define the imaged object as a way of deciding (brain vs. non-brain). Brain/non-brain and/or neural/non-neural could be ways of establishing how to make a decision.

Once the governance is explicit and agreed upon, then we can collectively make a decision about the current EMG proposal.

This contrasts with BEP020, because the proposal, as it stands now, does not propose any new data type or modality (suffix), and therefore, could be accepted straightaway since it just refines the existing standard.

@neuromechanist

Motion is a datatype in BIDS because it informs brain activity research;

This relates to the problem I mentioned about about the mistake of establishing ontologies rather than focusing on resolving the technical problem. Motion has its own datatype structure in BIDS because it went to the BEP adoption process and was accepted. That doesn't mean it informs or doesn't inform brain activity research. What I mean is that we should not get into that discussion now, it could be the argument to be made back in the day, but motion is today part of the spec.

However, per-eye data file requirement (left, right, cyclopean) could add six files to the host datatype. For datasets like HBN-EEG (9 tasks, 12 runs), adding eye-tracking could triple the file count; potentially less simple. (I'll discuss this further in #1128)

I think linking number of files with complexity is wrong. It's a poor link for humans (six files is definitely not a deluge esp. when they are next to other modalities they are informing, it's more complex not to see them next to those data blobs actually). It's definitely untrue for machines: indexing directories is more costly than indexing files in a flat structure.

All that said, BEP020 does not prevent you from storing all the data relating to the three channels in a single file (hence one tsv + json, two files total).

6. Consider EKG as an example; it can be a channel in EEG, a physio modality in the host EEG datatype, or standalone in a physio datatype. The simplicity principle could guide recommendations from both curator and user perspectives: EKG recorded with the same EEG instrument might be simpler as a channel; standalone EKG with different instruments might be best in the host datatype; but multiple physiological recordings (eye-tracking, GSR, metabolics, SpO2, EKG) might warrant using a dedicated physio datatype.

This is a problem I have tried to raise awareness about: it is very problematic IMHO that two researchers can encode the same data in two different ways because the standard allows for it. It basically breaks a foundation of repeatability.

The motion BEP enabled just that because you can write motion as physio and it will still be valid. This is why, IMHO, any BEP proposing new datatypes and/or modalities for data that already could be encoded (although with absolute underspecification of metadata and data structures) with BIDS should include language on how to discontinue writing them "the old ways".

  • So, EMG will easily overwhelm the host datatype.

This is biased. What "overwhelming the host datatype" means? Also, I think it enforces an interpretation I disagree with (see above).

  • but also using the physio modality and merely using recording-<label> is insufficient, IMO, to indicate that the data is both EMG and that there are multiple EMG devices with different characteristics at play.

This would require a targetted discussion within BEP042. As I suggest, I think first we should define "the rules" under which new modalities/datatypes can be expanded and then make these decisions within the common framework.

I believe that having a separate datatype for EMG makes data curation and reuse simpler

This is bringing the argument home. Let's first establish the framework, and once the framework is there, BEP042 will need to decide if they want to justify the new data type under the framework or perhaps give an opportunity to not expanding data types/modalities.

Please note that, expanding both data type and modality is a warning flag to me because it signals that there's not been an analysis of what exactly is required to represent the data.

oesteban avatar Jun 10 '25 06:06 oesteban

Following the discussion today, @yarikoptic suggested to make proposals as PRs and probably let the proposal-specific comments in each PRs. Should we make several draft PRs and present each to the community?

To experiment, PR #2135 is the succinct version of my proposal. It is compatible with the current state of BIDS, and make suggestions on how to decide not just for future BEPs, but when we need to make a decision put data under a channel, modality or datatype.

Here is also the rendered HTML.

neuromechanist avatar Jun 10 '25 15:06 neuromechanist

Although I tried to describe my position during the meeting, I feel I failed to communicate it effectively.

I have no issues with BEP042 introducing new data types, modality suffixes or formats. The issue, IMHO, is that BIDS currently lacks a structured way of revising these BEPs and explicit agreement thresholds to propose substantial changes to the spec (such as those proposed in BEP042). To fix that, I don't think we need changes to the spec itself, but to our decision making processes. For this reason, my PR with the proposal is done against the BIDS-website repo (bids-standard/bids-website#668 - rendered here).

My position is that, before any new BEP proposing new data types, suffixes, or formats is merged, we need to have a proper screening process and, if the changes are substantial, require some levels of participation and agreement.

In other words, if BEP042 passed the process flow I propose in my proposal (or whatever it ends up being after review and approval by whoever needs to approve my PR, evidently it is a proposal to be evolved by the community), then I'm more than happy to see it merged with anything the BEP involves. Otherwise, my opinion is that we will prioritize introducing changes fast without effective control to prevent mistakes, which may eventually undermine the entire specification effort.

oesteban avatar Jun 10 '25 20:06 oesteban

I also added two related proposals relating to my comments in BEP042 and my experience in other BEPs (014, 020, 038):

  • About the temptation of making BIDS a textbook (bids-standard/bids-website#669 - rendered here)
  • Making a proposal to update how formats are chosen (bids-standard/bids-website#670 - rendered here)

Like for the main proposal above (bids-standard/bids-website#668 - rendered here), we cannot now revise all those decisions we possibly would make differently today (e.g., the suggestion of not having a data type level in BIDS raw that came up in the meeting). But we can try to ensure we make better decisions moving forward by learning from past mistakes.

oesteban avatar Jun 10 '25 21:06 oesteban

@oesteban @neuromechanist

Thank you so much for your work on this, we highly appreciate it. We already briefly reviewed your PRs but haven't had the time to properly discuss them yet. We'll get back to you within the next week with updates.

tagging @yarikoptic and @effigies for reference.

julia-pfarr avatar Jun 12 '25 14:06 julia-pfarr

Here's a fresh can of worms for everyone (maybe). Hot take: both proposals could be included in the BIDS Website rather than the bids-specification as BEP development guidelines/process. I don't think either counters the other, they seem complimentary.

Please comment further referring to which you are commenting on:

  • @ericearl's hot takes ; or
  • @oesteban's proposal ; or
  • @neuromechanist's proposal

@oesteban's proposal

It took me a while to interpret Oscar's proposal. If anyone else had the same problem I did, the essence of Oscar's proposal (I think) is introducing new process to communally decide among the BIDS maintainers, steering, and contributors whether or not a BEP is allowed to introduce a new data type subdirectory, filename suffix, or file format with varying levels of involved voters and majority approval:

  • "Require strong rationale and 75% approval from 20 reviewers" for a new data type subdirectory
  • "Require rationale and 60% approval from 15 reviewers" for a new filename suffix
  • "Require rationale and 51% approval from 15 reviewers" for a new file format

I like this approach, but have some first thoughts. Considering the lack of ease to get even 5 reviewers sometimes, I would consider altering these to simplify the process further.

I instead propose a 51% approval from 8 or more reviewers for introducing any of the three.

My reasoning:

  1. BIDS seems to me to further develop with the people that show up to support developments. Any 8 or more people that show up and review the new introduction of a data type, suffix, or file format should be able to decide by majority.
  2. I expect BEP leads (experts on their BEP) would not consider introducing a new data type, suffix, or file format unless they felt it was a helpful delineation. This could be made clearer with guidelines too (like Yahya's proposal).
  3. New BIDS developments generally lean into the "80% of use cases is good enough to start" principle. So if a new data type, suffix, or file format proposal can get enough voters to approve out of 8 or more in a set period of time, then that should be enough to determine approval or disapproval.

I'm curious to hear others' related thoughts.


@neuromechanist's proposal

Yahya produced a good amount of content there, which took me a while to read (I'm a slow reader). I think the heart of it was the "Simplicity principle" (for BEP development):

  1. Minimize file requirements: Reduce the number of files needed (excluding sidecars) to describe new data
  2. Maximize reuse: Keep data types, modalities, and entities minimal while reusing existing structures for new data (metadata fields may change when new data is added to the specification)
  3. Avoid data type congestion: The majority of data files in a host data type directory should be directly related to that data type's primary purpose
  4. Balance reuse and congestion: Balance the benefits of reusing existing structures against the risk of overcrowding a data type with unrelated files
  5. Maintain data coherence: Keep data with shared characteristics (common acquisition clock, recording instrument, coordinate system) in one file unless specific features (such as task, run, or data source) warrant separation

To me the third point rings most truly. The fifth point is the next best, but with a critical edit. The other three are good guidance but not hard and fast rules. My reasoning:

  1. "Avoid data type congestion" seems the most important because if you're introducing a new data type subdirectory, then it needs to be serving a new purpose that prior data type subdirectories were not serving.
  2. Maintain data coherence seems the next most important, but only with a change from "in one file" to "in one subdirectory". I say this because I expect related experiment file formats with any suffix in one data type subdirectory, but I don't expect unrelated experiment data in the same data type subdirectory (even if the data type subdirectory was reused).

After typing this all up, I see it's a lot. Sorry for the fresh wall of text. Thanks all for your comments and hard work folks!

ericearl avatar Aug 08 '25 13:08 ericearl

I have been putting off reading the full details of both proposals, so thanks a lot @ericearl for the birds-eye view summarising key elements. IMO if we integrate those in our BEP development, that should allow @neuromechanist to decide on his BEP - having the voting system in place. (a la eric, as indeed, we won't get that many reviews)

CPernet avatar Aug 08 '25 13:08 CPernet

@ericearl

both proposals could be included in the BIDS Website rather than the bids-specification as BEP development guidelines/process.

I totally agree. Indeed, my proposal is against BIDS Website (bids-standard/bids-website#668) already.

I don't think either counters the other, they seem complimentary.

Agreed. However, both originate from the EMG BEP and should be revised/processed before making further steps within any BEP proposing new modalities (otherwise, there's risk that the BEP needs to backtrack, or that the BEP is actually passed before these proposals and then BIDS includes one more inconsistency).

the essence of Oscar's proposal (I think) is introducing new process to communally decide among the BIDS maintainers, steering, and contributors whether or not a BEP is allowed to introduce a new data type subdirectory, filename suffix, or file format with varying levels of involved voters and majority approval

Happy to make the proposal more accessible. I think the summary is accurate. Perhaps it would greatly help that people try with the rendered version of it (https://bids-website--668.org.readthedocs.build/en/668/extensions/process.html#decision-making-framework-for-introducing-new-data-types-suffixes-and-formats) --- happy to work on that flowchart and make it acceptable for a small-sized screen.

Regarding specific suggestions: definitely, maintainers and SC know much better than me what is realistic while safe in terms of requirements / thresholds / delays.

I expect BEP leads (experts on their BEP) would not consider introducing a new data type, suffix, or file format unless they felt it was a helpful delineation. This could be made clearer with guidelines too (like Yahya's proposal).

While I agree with this, my experience on the ground is that experts initiate the BEP without a deep understanding or experience with BIDS (which is totally understandable). That makes this decision particularly delicate, because by the time they may be able to actually make it the BEP is so advanced there's no painless way back. Adding modalities/suffixes/formats is relatively easy compared to making a thorough assessment of what BIDS already supports.

oesteban avatar Aug 08 '25 13:08 oesteban

@oesteban Thanks for the quick responses! I agree with everything there and only have responses to your last thoughts, but I fear it may be diverting the focused conversation here (which I don't want to do).

While I agree with this, my experience on the ground is that experts initiate the BEP without a deep understanding or experience with BIDS (which is totally understandable). That makes this decision particularly delicate, because by the time they may be able to actually make it the BEP is so advanced there's no painless way back. Adding modalities/suffixes/formats is relatively easy compared to making a thorough assessment of what BIDS already supports.

That's a very good point. BIDS supports A LOT though, so I would rather ask (on a separate issue or discussion maybe?):

How could we make the process of knowing what data types/suffixes/formats are already supported by BIDS easier for BEP leads?

I can imagine a few ways, but either way it's a matter of providing reference materials and training (required new BEP leads reading?). I can imagine reference materials all to be contained in the BIDS Website under the BEP development guidelines. I can imagine creating short and informative slide decks describing what data types, suffixes, and file formats are already available as well as more lengthy tables for reference. These slides and tables would have to be updated with each new BIDS release. The tables should be able to be generated from the schema.

This is just a short idea (as I'm guessing it requires a longer discussion), but I wanted to respond.

ericearl avatar Aug 08 '25 14:08 ericearl

That's a very good point. BIDS supports A LOT though, so I would rather ask (on a separate issue or discussion maybe?):

How could we make the process of knowing what data types/suffixes/formats are already supported by BIDS easier for BEP leads?

Absolutely. This is critical.

My take, though, is that acknowledge that's a big fish to fry, we should set up some mechanism for BIDS to alert BEP leads of this risky situation, so they become aware earlier of the potential rabbit holes.

So, I would gladly join an initiative to make the process better for prospective BEP leads, if we first make sure we first protect against the consequences of failing to do so (i.e., providing BEP leads with the required foundations).

oesteban avatar Aug 08 '25 14:08 oesteban

@oesteban Sounds like maybe another BIDS Working Group is in order for "making the process better for BEP leads". I'll touch base at a later date to help you set that up since I chair a different working group and know the process.

ericearl avatar Aug 08 '25 14:08 ericearl

Thanks @ericearl and everyone for the thoughtful discussion and recap.

  1. Should we establish a formal policy for what constitutes grounds for a new data type/modality?

At yesterday's schema meeting, @rwblair, @effigies and I ended up discussing whether we need a formal policy at all. I see the wisdom in the current process of deferring to BEP-leads and maintainers case-by-case. I hope this conversation help streamline future BEPs and avoid similar roadblocks. This deferring might align with @oesteban's consensus-based approach since the most involved community members are typically the BEP team and maintainers/steering.

Re: Maintain datatype coherence This point aimed to address how to manage redundancy in data storage: motion/EMG can be EEG channels, but I regularly get questions about when to store together vs. separately.

I appreciate generalizing this to datatype directories. Separate directories have advantages, IMO, especially for metadata (as @rwblair noted). For EEG-Motion capture experiments, two directories allow two events.tsv files - one per modality. This would be difficult in one directory since we'd need both events.tsv files for the same task/session.

Why not aggregate the events? To preserve 'raw' data. Events are often generated by recording instruments, and aggregation requires reconciling different starts/sampling frequencies, moving away from the 'raw' state we strive to maintain.

neuromechanist avatar Aug 08 '25 16:08 neuromechanist