stimuli BEP
- edit: at large being addressed by BEP044 lead by @neuromechanist in https://github.com/bids-standard/bids-specification/pull/2022 ; with some ideas slated for other efforts in bids-2.0 etc
As @Remi-Gau hinted by #695 , we still lack total clarity on original stimuli storage and annotation.
We do have
-
stimuli/folder which, likesourcedata/is nohow "prescribed" for a specific structure. -
stim_filecolumn in_events.tsvas to point to a (unregulated) location understimuli/and then populating thatstim_filedescription/HED tags within_events.json(bless the inheritance principle). -
"human wording" to point to the origin of a stimuli within
_events.jsonas possibly coming from some DB -
_stim.tsv.gzfiles for "signals related to the stimulus" (but not necessarily stimulus)
In respect to the first 3 items, and in conjunction with
- stimuli collections/datasets should be self-sufficient/described
- incoming requests to store stimuli datasets on DANDI archive
I wondered if there either an ongoing effort to standardize "stimuli datasets" so
- they could be readily reusable across studies by simply placing them under
stimuli/<name>and avoiding necessity to describe stimuli in_events.jsonsince information could be picked from their standardized layout - derivatives of them could be created and possibly shared along (e.g. all the feature extractions done by pliers (thanks to @tyarkoni , @qmac, et al)
With that in mind I am even thinking such datasets could follow BIDS mantra and just get "participant/subject" and sub- renamed to "stimulus"/stim-, and preserve README.md, dataset_description.json, stimuli.tsv etc
Worth a BEP/effort or may be it is already a "solved problem"? ;) WDYT?
Related:
- fresh #750 for storing
_stim.{mp3.mp4,...}along side with neural data. sidecar fields in that file could come handy why instrumenting specification of stimuli as presented from the shared across subjectssourcedata/(e.g.StartTime, possiblyTimeDriftbetween scanner and stimuli delivery for lengthy (an hour) presentation, etc)
At one point I talked to @Gilles86 about how he was storing stimuli, but don't clearly recall how deep we went. He might have some thoughts here.
Just to comment on one thing, I'm not sure stim-<label> buys much over <label>. It would be worth thinking about what are the orthogonalish dimensions that it would make sense to have entities for. A stimulus class name and an index to distinguish instances of that class are going to be most common. Then you may have some within class paramterization, but that's going to be really specific to the type of stimulus. For example, if you're interested in speaker-invariant speech representations, you might split your stimuli by speaker, but I don't see an entity that could cover all such parameterizations.
Something like <label>[_desc-<label>][_<index>].<ext> might cover most use cases without unnecessarily adding boilerplate.
Just to comment on one thing, I'm not sure
stim-<label>buys much over<label>.
- Having clean prefix allows to avoid collisions of
<label>s conflicting with other possible directories (e.g.code/,sourcedata/etc). - With similar argument
sub-<label>/andsub-<label>_IMHO also buy us nothing really, but that is what we have, and ~~likely because~~ they provide immediate "metadata" about the domain of<label>we are talking about, - Having
stim-<label>directory andstim-<label>_filenames prefix allows to generalize BIDS dataset layout to cover "stimuli BIDS datasets", wherestimentity serves analog tosubentity we have now.
Something like
<label>[_desc-<label>][_<index>].<ext>might cover most use cases without unnecessarily adding boilerplate.
I also hate boilerplate , and indeed in many use cases which might not even really need directories needed at all. BUT I can see stimuli collections where each stimuli could have a good number of files (audio, audio/video, images, etc) associated with that stimuli category; thus would be beneficial for organization and also navigation and reuse (clear "module" for a stimuli at the directory level). So, again, similarly to neuroimaging datasets where having just a single T1w image per subject, it might be sensible to have per-label directories. (moreover there could be multiple samples of the same label -- so semantically similar to _run- but that entity is really not a good fit for that, and indeed _desc- could be better)
PS although even may be run could have sense for some stimuli recordings of the same scene/action taken in sequence and otherwise having no immediate qualitative difference!
Hmm. Okay, fair enough. I guess the question is how much is this supposed to be BIDS-like or is it supposed to be BIDS? That matters for what entity names are chosen, since if it is BIDS, then we can't change the meaning of an entity too far. If it's just BIDS-like, then we can choose entities that are appropriate for stimuli with little regard for BIDS' existing definitions or ones that are likely to be claimed by future BEPs.
I would probably prefer BIDS-like, since a subject or a recording session is integral to a lot of definitions.
So here's a notion:
stimuli/
dataset_description.json
stim-<label>/
stim-<label>[_desc-<label>][_item-<index>].<ext>
stim-<label>[_desc-<label>][_item-<index>].json
-
stim-<label>would be the task-relevant class -
desc-<label>would be within-class parameterization -
item-<index>would be likerun-with no qualitative difference -
<ext>can probably indicate data type without resorting to an additional_<suffix>, which is another reason to be BIDS-like, instead of BIDS, where suffix is required.
An alternative (or addition) to desc- could be a stims.tsv that allowed you to explicitly say that here are relevant factors:
stimulus speaker_id speaker_gender tone
stim-word1_desc-sp1normal_item-1.wav 1 M normal
stim-word1_desc-sp1normal_item-2.wav 1 M normal
stim-word1_desc-sp1strained_item-1.wav 1 M strained
stim-word1_desc-sp1strained_item-2.wav 1 M strained
Then we presumably need a stims.json to define columns.
Thank you @effigies !!! I feel like we are on the same page and progressing leaping forward ;)
I guess the question is how much is this supposed to be BIDS-like or is it supposed to be BIDS? That matters for what entity names are chosen, since if it is BIDS, then we can't change the meaning of an entity too far.
what entities meaning you see needing much of adjustment? Even for run I feel we would not need much of adjustment although some might already be a bit overdue: filed https://github.com/bids-standard/bids-specification/pull/760 . So not sure if we really need to introduce _item in favor of _run just yet
An alternative (or addition) to
desc-could be astims.tsv
+1 on that. Additional thoughts:
If we are to retain scans as a term, and aim for "BIDS" (not just "BIDS-like") then such a file would be analogous to _scans.tsv we already have, thus be stim-word1/stim-word1_scans.tsv...
with aforementioned #760 in mind, I wonder if with this "stimuli BEP" we could indeed be the first to generalize that into samples (from scans) or some other good generic term?
But then I see the point of having top level stims.tsv analogous to participants.tsv to describe common high level attributes for each stimuli <label> such as
stimulus_id language word_class ...
word1 english noun
what entities meaning you see needing much of adjustment? Even for run I feel we would not need much of adjustment although some might already be a bit overdue: filed #760 . So not sure if we really need to introduce
_itemin favor of_runjust yet
It feels like shoehorning an experimental notion into a corpus description. I would rather step back and think about what would make a good corpus standard with minimal reference to BIDS.
Maybe if you're thinking of the generation of the stimuli as a procedure that is repeated multiple times, run works. But perhaps I'm sampling from a larger corpus where the notion doesn't apply (e.g., going back through BBC archives for different pronunciations of words).
An alternative (or addition) to
desc-could be astims.tsv+1 on that. Additional thoughts: If we are to retain
scansas a term, and aim for "BIDS" (not just "BIDS-like") then such a file would be analogous to_scans.tsvwe already have, thus bestim-word1/stim-word1_scans.tsv... with aforementioned #760 in mind, I wonder if with this "stimuli BEP" we could indeed be the first to generalize that intosamples(fromscans) or some other good generic term?But then I see the point of having top level
stims.tsvanalogous toparticipants.tsvto describe common high level attributes for each stimuli<label>such asstimulus_id language word_class ... word1 english noun
Yeah stims.tsv and samples.tsv makes sense to me.
Quick thought to point out what @sappelhoff mentioned regarding subject specific stimuli here: https://github.com/bids-standard/bids-specification/pull/750#issuecomment-796591440
I don't think that it will such a rare case and we should probably give that some thought.
If it is just a matter of a raw stimulus being adapted to each participant, this could be treated as derivatives but having a way to describe the subject the stimulus is for would be a good thing.
Use the desc label to do that?
stimuli/
dataset_description.json
stim-<label>/
stim-<label>[_desc-<label>][_item-<index>].<ext>
Reuse the sub entity?
stimuli/
dataset_description.json
stim-<label>/
stim-<label>[_sub-<label>][_desc-<label>][_item-<index>].<ext>
Also does it make sense to have "prefix" or is it really shoehorning too much BIDS into this?
RE: sub entities, is the stimulus truly related to the subject, or is it that each subject gets a different stimulus? For a stimulus dataset that needs to be able to be understood in isolation, I'm wary of infecting with a separate notion. For example, maybe I created the stimuli for the subjects in a particular study, but then I want to perform a second study with the same stimuli, and the stim-movie_sub-01.mp4 no longer is viewed by sub-01 in my new study. Or maybe it's viewed by sub-01 and sub-38.
I would suggest that this would be a good use case for item. If sub-01 watches stim-movie_item-01.mp4, sub-02 watches stim-movie_item-02.mp4, and so on, then there's a straightforward mapping, but it is not confusing if it doesn't apply when the same stimulus set is used in a different study.
Also does it make sense to have "prefix" or is it really shoehorning too much BIDS into this?
I don't really understand this question. Could you clarify?
RE: sub entities, is the stimulus truly related to the subject, or is it that each subject gets a different stimulus?
Yes but...
For a stimulus dataset that needs to be able to be understood in isolation, I'm wary of infecting with a separate notion.
I will get specific to better explain.
So the case I have in mind the stimuli are literally made for each participant: participants are presented with sounds played from different locations, the sounds are recorded with microphones placed next to their ears so that the sound can replayed to them in the scanner as if they were listening to sound coming from that very specific location. Each person has their own "head related transfer function" that filters the sound in a given way, so each participant has their own set of sounds.
This is very much related to a given dataset so in most cases it won't work in isolation from the data.
But even if you "ship" the stimuli with the BIDS dataset I am wondering if it would make sense to worry about this to the level of having an entity that "pairs" a stimulus to a subject. Sort of thinking this is in the 20% of our pareto principle.
My two cents: this feels to me like way more trouble than it's worth. Just give each stimulus a unique stim and/or item label, and then you can map between stimuli and subjects using a .tsv file that maps between them, or by adding a sidecar to each stimulus that indicates which subject they're for. Otherwise it gets very messy because everywhere else in BIDS that sub occurs, it's mandatory. I would honestly even consider sticking with just stim in the filename and doing everything else with a stims.tsv file. The space of potential stimuli and their applications seems to me too wide to plausibly encode in a meaningful way in filenames.
I also think this (i.e., stimulus naming/encoding) is a big and important enough problem it could easily be spun off into its own non-BIDS spec, and just be wrapped later by BIDS.
I also think this (i.e., stimulus naming/encoding) is a big and important enough problem it could easily be spun off into its own non-BIDS spec, and just be wrapped later by BIDS.
ReCorDS (Research Corpus Data Structure)?
I don't really understand this question. Could you clarify?
as many files in BIDS are of the form [entity1-<label>][_entityX-<label>]*_<suffix>.<ext> I was just thinking if having a suffix in there made any sense.
My two cents: this feels to me like way more trouble than it's worth. Just give each stimulus a unique
stimand/oritemlabel, and then you can map between stimuli and subjects using a .tsv file that maps between them, or by adding a sidecar to each stimulus that indicates which subject they're for.
Makes sense: when I wrote my last reply, I started thinking of an optional "intended for" or something equivalent instead of an entity
I also think this (i.e., stimulus naming/encoding) is a big and important enough problem it could easily be spun off into its own non-BIDS spec, and just be wrapped later by BIDS.
Agreed. That is definitely something where I would like to hear the opinion of the psych-DS folks for example.
Thanks everyone! I think this discussion, along with #750, resonates also with the recent discussion of BEP032 where sub- top level might often be very suboptimal (e.g. consider a "tissue" or a "cell" to be a main domain of differentiation between recordings).
So at to my ear "big and important enough problem it could easily be spun off into its own non-BIDS spec" might be a generalization of BIDS 1.x (may be even for the BIDS 2.0?), since it seems could be largely made "backward compatible" (BIDS 1.x datasets would still be "valid"), where
- a hierarchy would be not "hardcoded" to be a
sub-[/ses-]but a specification (e.g. defined indataset_description.json):- would be based on entities we define in https://github.com/bids-standard/bids-specification/blob/master/src/schema/entities.yaml#L2
- may be we should restrict to allow only some entities to be promoted to "hierarchy", or have a vetted list of possible "hierarchies". But mechanism would be generic:
- in general specification would be
["<entity#1>", "[<entity#2>]", "..."], where[]would signal optional (if present) inclusion. - e.g.
["subject", "[session]"]for default/BIDS 1.0 - and
["stimulus"](yet to be added as entity) for stimuli datasets, but could as well be["stimulus", "subject"](or swapped order) if such dataset has many per subject stimuli - some users of BEP032 will be happy to use
["tissue", "cell"]
- would be based on entities we define in https://github.com/bids-standard/bids-specification/blob/master/src/schema/entities.yaml#L2
- "lessons learned" consistency introduced:
- top level includes
<entity:plural>.{tsv,json}(we will need to add "plural" per each entity inentities.yaml, e.g. "stimuli" for "stimulus" entity)- we have
sub-butparticipants.tsv: we can generalize intosubjects.tsv. Havingparticipants.tsvwhile operating onsubjectentity is just a pretense of no gain IMHO. - so we get
stimuli.{tsv,json}
- we have
-
_scans.{tsv,json}is generalized into_samples.{tsv,json}or dissolved entirely:- insofar I see it as a "summary" of metadata which generally should be present in each particular scan/sample sidecar .json file.
- top level includes
I think with such generalization, it would allow for establishing BEPs like this, as easily as adding a few (if any) missing entities, and "vetting" a "new" hierarchy layout. With ongoing effort by @tsalo in formalizing the schema, any BIDS tool using that schema, would be able to immediately support such a "novel" layout. The interesting and important questions would be on what metadata to include.
Sorry if I derailed a bit ;-)
oh (sorry for the dump) - I just realized, that it generalizes very nicely for what many (myself included) were missing: per entity level specific metadata, and in general it is
[ent1-<label>_...]_<ent?:plural>.{tsv,json}
where
-
[ent1-<label>_...]are entities from prior levels, such as nowsub-<label>[_ses-<label>] -
<ent?:plural>is the one for the level. Such as "participants.tsv" (no prior levels).- and to some degree it is currently
_scansbut that is where seamless generalization breaks on many points: - Since that is the final level at which we have datatypes which are not an entity per se ATM and does not follow
datatype-<label>/but just<label>/naming - it does not provide some common details across data types but rather lists all individual "samples" across all data types
- and to some degree it is currently
BUT, it generalizes nicely into
-
sub-<label>/sub-<label>_sessions.{tsv,json}: on so many occasions I wondered and people asked: "where do I place per-participant information for different sessions?". Current solution is to serialize it withinparticipants.tsv, and BIDS seems to be silent on how to deal with it
so people come up with ad-hoc cross-product of the two with session or session_id column and either just id or with `ses-` values
(git)smaug:/mnt/btrfs/datasets/datalad/crawl/openneuro[master]
$> grep -A2 session ds*/participants.tsv
ds001541/participants.tsv:participant_id session run1 run2 run3 run4 Viral_infusion_date MRI_acquisition_date weight group day_post_infusion gender viral_vector
ds001541/participants.tsv-562 2 33 100 66 n/a 2014-01-09 2014-03-10 30.8 exp 60 male ChR2-eYFP
ds001541/participants.tsv-562 1 100 g100 g100 100 2014-01-09 2014-03-11 30.6 exp 61 male ChR2-eYFP
--
ds001653/participants.tsv:participant_id session gender weight acquisition_date breathing_rate condition
ds001653/participants.tsv-sub-jgrAesAWc11R1L ses-1 f 20.6 2017-08-11 150 awake
ds001653/participants.tsv-sub-jgrAesAWc12R ses-1 f 22.4 2017-08-11 240 awake
--
ds001890/participants.tsv:participant_id session sex genotype Weight SpO2 HR Temperature DOB Experiment_Date Age
ds001890/participants.tsv-c1NT 1 M 3xTG 32.3 98 272 35.8 2016-11-22 2017-03-23 3
ds001890/participants.tsv-c1NT 2 M 3xTG 36.2 94 311 35.8 2016-11-22 2017-05-31 6
--
ds002134/participants.tsv:participant_id session genotype virus age sex Weight Temperature DOB Surgery_date Experiment_Date run-1 run-2 run-3 run-4
ds002134/participants.tsv-jgroptoAD100 1 C57BL/6 mCherry 3 M 30 36.3 2018-12-11 2019-04-01 2019-04-20 n/a 10 20 5
ds002134/participants.tsv-jgroptoAD101 1 C57BL/6 mCherry 3 M 29.6 36.6 2018-12-11 2019-04-01 2019-04-20 n/a 5 10 20
--
ds002154/participants.tsv:participant_id session gender condition weight Experiment_Date
ds002154/participants.tsv-1 1 m veh 29.3 2015-10-14
ds002154/participants.tsv-1 2 m psi05 29.3 2015-10-14
--
ds002307/participants.tsv:participant_id DOB rs-fMRI 1 rs-fMRI 2 rs-fMRI 3 rs-fMRI 4 rs-fMRI 5 rs-fMRI 6 rs-fMRI 7 excluded_rs-fMRI_sessions dMRI
ds002307/participants.tsv-Ey112 20160126 20160415 20160417 20160418 20160419 20160421 20160424 20160425 x 20160426
ds002307/participants.tsv-Ey113 20160126 20160415 20160417 20160418 20160419 20160421 20160424 20160425 x 20160426
--
ds002547/participants.tsv:participant_id sex age validation_session
ds002547/participants.tsv-sub-01 F 24.0 1.0
ds002547/participants.tsv-sub-02 M 21.0 1.0
--
ds002995/participants.tsv:participant_id weight age gender num_sessions
ds002995/participants.tsv-sub-007 68 24 F 1
ds002995/participants.tsv-sub-008 70 22 F 2
--
ds003416/participants.tsv:participant_id session_id sex age handedness
ds003416/participants.tsv-cIs1 s1Ax1 male 25 left
ds003416/participants.tsv-cIs1 s1Ax2 male 25 left
--
ds003464/participants.tsv:participant_id session genotype virus Experiment_Date sex weight delta_preference
ds003464/participants.tsv-jgroptoINS501 2 C57BL/6 ChR2-mCherry 2018-09-11 M 32 n/a
ds003464/participants.tsv-jgroptoINS503 2 C57BL/6 ChR2-mCherry 2018-07-25 M n/a 0.11
--
ds003470/participants.tsv:participant_id session_id age sex size weight
ds003470/participants.tsv-sub-01 ses-1 26 F 1.63 55
ds003470/participants.tsv-sub-02 ses-1 18 M 1.82 67
So overall generalization could be
-
[ent1-<label>_...]_<ent?:plural>.{tsv,json}for a level which has some other entity as sub-level -
[ent1-<label>_...]_samples.{tsv,json}- if that is the last level in the "hierarchy" and then it is followed by "datatypes"... but isn'tstima datatype? (need to think more ;))
Agreed. That is definitely something where I would like to hear the opinion of the psych-DS folks for example.
I should have looked into @Psych-DS earlier. Initiated some dialog on Psych-DS spec google doc. Indeed might align nicely if we could allow for different layout (not ["subject", "[session]"])
Psych-DS 'maintainer' here (we have a tech spec, no released validator software yet) Psych-DS is very firmly in the "BIDS-like" rather than "BIDS" category, and one of the main differences at least in v1 is we are not enforcing ordering of the key-value pairs in directory or filename structure.
A possible use case would be the ability to take a BIDS dataset and "compile out" the behavioral task data, e.g. for an existing pipeline designed for out-of-scanner analysis of task data, or conversely, "compiling in" task data that's collected in a non-BIDSlike form but that is associated with BIDS data. Psych-DS is scoped primarily for behavioral data rather than stimuli, but I think there's no particular reason there couldn't be other clear paralells
One point to note is that Psych-DS uses/will use JSON-LD metadata, i.e Schema.org/Dataset. A stimulus set version of Psych-DS would probably want to use some other kind of combo of CreativeWork, ImageObject etc
FWIW, added a stub for possible BIDS 2.0 development: https://github.com/bids-standard/bids-2-devel/issues/54