bids-examples
bids-examples copied to clipboard
Example of the HED schema library for SCORE implementation
This PR includes examples of the HED schema library for the Standardized Computer-based Organized Reporting of EEG (SCORE) implementation in several EEG and invasive EEG datasets in the two ways to annotate HED in BIDS.
BIDS dataset was validated with [email protected] and HED annotations were separately validated with the example jupyter notebook and the hedtools that will be incorporated in the bids-validator soon.
Before this example can be merged, this other pull request in the bids-specificartion that adds the use of an HED schema library bids-specification pull request needs to be merged first. Then the dataset_description.json can be updated to include the SCORE library schema.
crosslinking, this PR needs a spec PR and a following validator PR to be merged first:
- spec: https://github.com/bids-standard/bids-specification/pull/1106
- validator: https://github.com/bids-standard/bids-validator/pull/1496/
The PRs mentioned in https://github.com/bids-standard/bids-examples/pull/324#issuecomment-1192212450 have meanwhile been merged, so I think you can address some of the reviewer comments above @tpatpa
I made minor edits to match the latest HED-SCORE library schema. Validation is still failing because the validator is validating against a previous version of the HED-SCORE library schema. We are in the final stages before releasing the schema's latest version. Once released this will be validated and could be merged as well.
We just released HED-SCORE V1.0.0. If you follow the link make sure to click on "Show another version" to browse. The remaining question for this BIDS-HED-SCORE example is on how to list a set of channels affected by an event. It seems like the discussion in the electrophysiology derivatives was going for a Pythonic list of strings, eg ['C3','F3'] or ['C3-F3','P3-C3'] in a separate channels column. Any preference for these examples? @sappelhoff @robertoostenveld @VisLab @guiomar @CPernet
congrats Dora, Tal, et al!
It seems like the discussion in the electrophysiology derivatives was going for a Pythonic list of strings
yes, this should be a link to the specific discussion thread: gdoc discussion
My preference is indeed the "pythonic list of strings" (clarified more in the above linked thread) but apart from Robert and me, nobody seems to have offered an opinion yet.
+1 list of strings
It seems like the discussion in the electrophysiology derivatives was going for a Pythonic list of strings, eg ['C3','F3'] or ['C3-F3','P3-C3'] in a separate channels column.
Could you clarify what would be meant by: ['C3-F3','P3-C3']? If it means a group of channels as ordered by the channels.tsv
, I don't think this should be allowed. Which channels are in the data and how they are ordered can change during computation, so this is particularly risky for derivatives. The same comment applies to channel numbers.
Also, given that quotes are usually not used in events files, would [C3, F3] be preferred?
One further option, which perhaps the HED working group should discuss separately, is how HED might be used to define named groups of channels which could then be referred to by name. This does not preclude the need for deciding on a format for a channels column in the events files and their derivatives.
Let me copy the discussion thread from the Gdoc here, perhaps that clarifies some things:
@sappelhoff :
I think a "pythonic" way to specify channels would be nice.
- A list of strings where each string is an affected channel
- An empty list means no channel is affected
- "n/a" means the interpretation of "channel" is not applicable or the data is not available
We could additionally think about a shorthand to indicate "all channels are affected". Perhaps the string "all" would work for that ... if by chance a channel is named "all" this would still not be ambiguous, because specifying individual channels needs to be in a list like ["all"]
@robertoostenveld :
I don't care too mush, but think that we should not be restrictive in formatting. And to note: channels in the annotations file do not have to correspond to channels in the data. E.g., how would you document an interictal spike that is observed in a bipolar refreferenced channel ("C3-F3"), whereas the data is stored and shared with a common reference ("C3-REF", or probably just "C3"). Annotations could also refer to electrodes rather than channels, such as "video inspection revealed that electrode C3 fell off at 10 minutes into the recording".
@sappelhoff:
Thanks for these valid examples. Regarding the first one, one could argue that if an interictal spike at "C3-F3" is observed, and the dataset curator wants to annotate this event, then they should include that channel in the data.
Regarding the "electrode" example: We could have an additional column called "electrode" that follows the same formatting as "channel" (the column we are discussing here), but strictly referring to electrodes.
Curators could then for each event specify any number of channels and/or electrodes that are related to the event.
@sappelhoff :
I think a "pythonic" way to specify channels would be nice.
- A list of strings where each string is an affected channel
- An empty list means no channel is affected
- "n/a" means the interpretation of "channel" is not applicable or the data is not available
We could additionally think about a shorthand to indicate "all channels are affected". Perhaps the string "all" would work for that ... if by chance a channel is named "all" this would still not be ambiguous, because specifying individual channels needs to be in a list like ["all"]
I am in favor of this format.
I realize that this discussion pertains to derivatives, however, it might also be relevant to events in the main BIDS specification.
Right now the BIDS specification of events.json
indicates two options for describing column values (discounting HED): Levels
and Units
.
If Levels
are given, the column is assumed to contain categorical values.
If Units
are given, the column is assumed to contain numeric values expressed in the specified units.
Would this new format of column value require a new List
type -- list of numbers or list of strings?
Right now the BIDS specification of
events.json
indicates two options for describing column values (discounting HED):Levels
andUnits
.
Maybe we can use IntendedFor
in the corresponding JSON file to link to the relevant channels file similar to how fieldmap data is linked to a specific scan?
One further option, which perhaps the HED working group should discuss separately, is how HED might be used to define named groups of channels which could then be referred to by name. This does not preclude the need for deciding on a format for a channels column in the events files and their derivatives.
I agree that developing an "HED way" to deal with this problem would be very nice in addition to a "straight forward - non HED" BIDS solution.
Right now the BIDS specification of events.json indicates two options for describing column values (discounting HED): Levels and Units.
you are raising a very important point here, thanks -- I have not considered that my proposal would introduce a new "data type" for rows in TSV files apart from those defined by Levels
and Units
. It wouldn't make sense to consider the lists of strings (or the []
, "all"
, and "n/a"
values that'd be possible) as "Levels", because there'd simply be too many levels without providing a lot of insight.
We should:
- think about alternative solutions
- think whether introducing a
List
type makes sense in a more general sense (beyond our present problem), because I would be hesitant to introduce a concept that is only sensible / meaningful for one particular application - beyond the potential
List
data type, the present proposal would also permit"all"
as a shorthand for a full list of channels and the standard"n/a"
. While permitting"n/a"
is easy and needed anyhow, ... what do you think of the"all"
shorthand? Needed? Superfluous? ...
In any case, we need some more people to chime in :-)
Maybe we can use IntendedFor in the corresponding JSON file to link to the relevant channels file similar to how fieldmap data is linked to a specific scan?
Not all JSON files have an IntendedFor
metadata field, but I like the general direction here ...
One other possibility would be just to use a string for the channel list rather than an actual list. Then BIDS doesn't have to be concerned about validating necessarily, although it could.
"[Cz7, O1, O2]"
Downstream tools could then do their own thing to interpret a channels column, parse, and make sure that these are real channels.
Right now the BIDS specification of events.json indicates two options for describing column values (discounting HED): Levels and Units.
you are raising a very important point here, thanks -- I have not considered that my proposal would introduce a new "data type" for rows in TSV files apart from those defined by
Levels
andUnits
. It wouldn't make sense to consider the lists of strings (or the[]
,"all"
, and"n/a"
values that'd be possible) as "Levels", because there'd simply be too many levels without providing a lot of insight.We should:
- think about alternative solutions
- think whether introducing a
List
type makes sense in a more general sense (beyond our present problem), because I would be hesitant to introduce a concept that is only sensible / meaningful for one particular application- beyond the potential
List
data type, the present proposal would also permit"all"
as a shorthand for a full list of channels and the standard"n/a"
. While permitting"n/a"
is easy and needed anyhow, ... what do you think of the"all"
shorthand? Needed? Superfluous? ...In any case, we need some more people to chime in :-)
I think a List
data type would help. This may also be helpful beyond the present problem for other ephys cases where multiple channels or electrodes could be listed and I can imaging other cases where multiple items may to be indexed beyond channels or electrodes.
Maybe we can use IntendedFor in the corresponding JSON file to link to the relevant channels file similar to how fieldmap data is linked to a specific scan?
Not all JSON files have an
IntendedFor
metadata field, but I like the general direction here ...
This may be similar or comparable to a case in the BIDS-connectivity-BEP, for channel-to-channel connectivity metrics where we also want to interpret a column with respect to an _channels.tsv or _electrodes.tsv file (discussing there whether this should be a sourceAtlas, but it is not an actual atlas...).
For the format of this field a "pythonic list" has the downside of encouraging developers to do an eval which is not a safe practice. A comma separated list (or a single item) would work effectively as a representation. Brackets aren't strictly necessary as these files are a TSV.
CCing @christinerogers for EEGnet
For the format of this field a "pythonic list" has the downside of encouraging developers to do an eval which is not a safe practice. A comma separated list (or a single item) would work effectively as a representation. Brackets aren't strictly necessary as these files are a TSV.
I agree with you -- I guess I proposed it this way to distinguish n/a
from a potential channel name n/a
and to have a convenient shorthand all
that is different from a channel potentially called all
. But we can probably come up with better ways.
Super, thanks @sappelhoff -- I'll raise this at the BIDS maintainers meeting, since @Andesha 's request for code-agnostic (consistent, comma-separated instead of pythonic) lists helps those making BIDS-supporting data platforms (like EEGNet which is webfacing) across modalities.
Has this been discussed elsewhere in the BIDS specs, i.e. is there an implementation cost concern? cc @CPernet
Does this make sense to you @arnodelorme @VisLab ? thoughts, @dorahermes @tpatpa ?
Discussing this with a few maintainers we settled on allowing json arrays whose only values are strings as an acceptable solution. The json specification allows json documents to start with a square bracket so any languages implementation of the json parser should be sufficient.
We should also be able to represent this in the schema well enough to tip the validators off that elements of the column may need to be parsed.
@effigies did I misrepresent anything here?
Hi @rwblair , My takeaway from this maintainers meeting was that it wasn't decided, and will be raised again next time after some community consultation (@sappelhoff). There's definitely concern about the embedded JSON suggestion on our side -- e.g. within a TSV it's not only hard to read but also can't be validated easily by those preparing data files, so significantly more error-prone.
@rwblair nothing misrepresented, but here's a more extended version:
We discussed two possible approaches:
- Comma-separated values:
a,b,c
- JSON arrays:
["a","b","c"]
Advantages to CSV:
- Simple. Just use
split(val, ",")
, which will come in every language and is easy to write your own if forced. - You can eyeball it.
Advantages to JSON arrays:
- Typed. The schema could declare a value as having type
array[str]
(actually something uglier, but you get it) and the validator can check it. - Proximal. If a tool is working with BIDS data, it should have access to a JSON parser. No eval needed.
- Escapable. JSON supports unicode escapes, in case someone needs a tab or newline character inside their strings, which TSV with a CSV insert would interpret as an end-of-column or end-of-row. (I hope people don't do this, but in the extreme...)
Hi @rwblair , My takeaway from this maintainers meeting was that it wasn't decided, and will be raised again next time after some community consultation (@sappelhoff). There's definitely concern about the embedded JSON suggestion on our side -- e.g. within a TSV it's not only hard to read but also can't be validated easily by those preparing data files, so significantly more error-prone.
My apologies for the ambiguity and finality in tone of my post. This was from a chat after the maintainers meeting you presented at. I completely agree with you that it needs more discussion here and in the next maintainers meeting.
thanks @rwblair - let's continue this detailed point in a more general/accessible place than this channel. cc @sappelhoff happy to follow your lead here.
Let's continue the discussion in this dedicated issue, where I summarized the present situation:
- https://github.com/bids-standard/bids-specification/issues/1446
this present PR is
- buried in the examples repo
- intended to achieve something else
Per discussion during the bids-derivatives meeting, bipolar channels in this case remain in the raw data (there is no previous BIDS dataset, so therefore these data are not derived). The current channel names (e.g. FP1-F7) in _channels.tsv is therefore correct.
unrelated note - the dataset_description can contain SourceDatasets which point to the TUH corpus (which is non-BIDS).
Looks good....
Looks good to me, ok after it is validated!
@tpatpa I rebased your branch on bids-standard/bids-examples@master
. You will have to run the following command locally before you continue to work on this: git fetch --all
and then git reset --hard origin/master
NOTE if you have uncommitted local changes on your branch, do not run the above ☝️ ... just comment here in that case :-)
And in the future, it would be good if you could make a new git branch before submitting a pull request. Submitting pull requests directly from your master
branch makes it difficult for you to stay up to date with the upstream repository (bids-standard/bids-examples@master
).
We're having a chicken and egg problem here... The HED version must be a released version, but SCORE 1.1.1 has not been released yet due to unresolved issues. @dorahermes can you review 1.1.1 and we can discuss the issues offline and move towards release.