bids-specification [ENH][BEP028] Specification update for BEP028 BIDS-Prov

This is a work in progress PR proposing a specification update for BEP028 BIDS-Prov.

Apr 10 '25 15:04 bclenet

Hi @yarikoptic,

As part of this PR (intagration of BEP028 in the BIDS specification), I'm adding the BIDS-Prov context into the repo. For now it's under : src/modality-agnostic-files/provenance-context.json but do you see a better place where to add it ? Thanks !

May 23 '25 11:05 bclenet

Hi @effigies ,

Here are quick questions about the schema modifications in this PR. I tested the BIDS examples listed in the description of this PR against a BIDS validator using the modified schema, but I still encounter errors.

Provenance files (.json) inside the prov directory are considered as sidecars, leading to SIDECAR_WITHOUT_DATAFILE error files
Provenance files (.json) inside the prov directory are NOT_INCLUDED according to the validator, although

I think this is due to the fact that the src/schema/rules/files/common/modality_agnostic.yaml file I've added has no impact for now.

Also, files inside derivative datasets lead to NOT_INCLUDED / ALL_FILENAME_RULES_HAVE_ISSUES / FILENAME_MISMATCH / ENTITY_WITH_NO_LABEL errors because their names do not conform to BIDS naming. It seems to be contradictory to this part of the spec

Could you please give your thoughts about these ? Let me know if I should post these anywhere else.

Thanks,

Sep 11 '25 09:09 bclenet

@satra @yarikoptic: a point on which we'd love your input w/ @bclenet. It seems to us a bit problematic to use the word "Entity" in the Provenance spec for BIDS since entity in BIDS is already defined (and is different from the meaning we give in BIDS-Prov). We could easily use a different word (that would still be mapped to prov:Entity in the context file), but which word...

Just found out about "Transput" which seems to be a blanket term for both Inputs and Outputs, would that work for you both? Or any other suggestions?

Thanks!

Sep 16 '25 14:09 cmaumet

Here are the outstanding issues I can see, given the current state of the pull requests:

SIDECAR_WITHOUT_DATAFILE errors. I have pointed @bclenet to where this would need to be addressed: https://github.com/bids-standard/bids-validator/blob/c331a16/src/validators/internal/unusedFile.ts. The basic problem is that JSON files are not generally distinguishable from sidecars, which we do want to check apply to some file. We check against the two exceptions to this rule, but that list inclusion check could be made a function to handle prov.
Unvalidated prov contents. Use schema/rules/json/prov.yaml to define the contents of the files. I hope the existing files will be a good model, but let me know if you need some scaffolding there.
The arbitrary subdirectories are going to be a pain. The simplest thing would be to drop the idea and have a flat directory, which works right now. The alternative is probably going to involve new concepts in both the file rules of the directory rules (cc @rwblair): https://github.com/bids-standard/bids-specification/blob/aa2dda5b2f279606de450e8d47d4a36f02981e03/src/schema/rules/files/common/modality_agnostic.yaml#L5-L17 https://github.com/bids-standard/bids-specification/blob/aa2dda5b2f279606de450e8d47d4a36f02981e03/src/schema/rules/directories.yaml#L48-L109

Oct 03 '25 19:10 effigies

I had a chance to talk with @rwblair about the arbitrary subdirectories yesterday. At least from the examples given (/prov/preprocspm/prov-preprocspm{1,2}_*.json), the idea is to allow grouping, which BIDS generally does with entities. Why not:

/prov[/prov-<label>]/prov-<label>_desc-<label>_<suffix>.json

Although a bit different from how BIDS has done things (data type before first entity directory), the machinery we already have in the schema is sufficient to encode this and the changes to the validator should not be difficult.

Oct 07 '25 14:10 effigies

@effigies , thanks for the input on the arbitrary subdirectories ! We'll discuss that with Yarik and Camille tomorrow.

Oct 07 '25 15:10 bclenet

/prov[/prov-<label>]/prov-<label>_desc-<label>_<suffix>.json
Although a bit different from how BIDS has done things (data type before first entity directory),

FWIW, I like it since it is generic. Might also be applicable to e.g. BEP044:Stimuli where ATM stimuli files are not groupped but have stim-<label>_...otherentities flat listing.

But overall it comes to the question when is worth keeping flat vs creating those folders, and it is kinda a generic aspect: e.g. if there is a BIDS dataset with only T1w images for 10 subjects -- folders are not really making it easier to navigate the data. Somewhat of a usecase for

https://github.com/bids-standard/bids-specification/pull/1809 .

Similarly here: if it is just a single "stage" derivative dataset (e.g. bids-app applied in one go across all subjects) -- there is no need for subfolders there, right?

Oct 07 '25 20:10 yarikoptic

I have no objections to keeping it flat. I was under the impression that nesting was important.

Oct 07 '25 20:10 effigies

@effigies I see now that you proposed to make proc-<label>/ folders optional -- would that be somehow consistent with the rest of BIDS and wouldn't introduce difficulty? I do not remember any other entity for which we have it optional kinda.

Per our discussion I also would recommend establishing prov/provenance.{tsv,json} with descriptions for those labels. Here I assume that provenance is the plural form here. Begs a question if should be provenance/ folder then, similar to stimuli/ for stim- entities in that BEP etc.

I lean toward always requiring them for the sake of consistency.

Oct 08 '25 14:10 yarikoptic

would that be somehow consistent with the rest of BIDS and wouldn't introduce difficulty? I do not remember any other entity for which we have it optional kinda.

We don't, but it is a convention, not a technical problem.

Per our discussion I also would recommend establishing prov/provenance.{tsv,json} with descriptions for those labels. Here I assume that provenance is the plural form here. Begs a question if should be provenance/ folder then, similar to stimuli/ for stim- entities in that BEP etc.

I lean toward always requiring them for the sake of consistency.

I defer to the BEP leads to make the proposal they want.

Oct 08 '25 23:10 effigies

Hi @rwblair,

As discussed earlier, here are the issues I'm currently facing:

schema | validation of provenance files: I created the schema/rules/json/prov.yaml file; are there other things to do in this files for validating the contents of the objects in the Activities (resp. Environments, ProvEntities, Software) arrays ?
schema | validation of provenance-related metadata fields in sidecars: should I create a schema/rules/sidecars/prov.yaml to describe that some new provenance-related metadata fields are optional in all sidecars ?
validator | .json provenance files are considered as sidecars, hence raising the error : SIDECAR_WITHOUT_DATAFILE -> i'm afraid i'll not be able to come up with a clean modification of the validator to mark these files as standalone json.
schema | how/where to allow for a prov/[prov-<label>] directory in the schema ?
macros | how to add an optional prov/[prov-<label>] in the filename templates of the Provenance Files section (using the MACROS___make_filename_template macro)
CI | pdf rendering fails due to a assert_no_multiline_links which I don't understand

Thanks for your help :)

Oct 16 '25 15:10 bclenet

@bclenet While thinking about the first two points on how to validate certain parts of the files I started to realize the nature of the changes that will be required of the schema to validate them.

Here is my understanding of the main rules that this bep wants to enforce that require information outside of what has historically been used for validation, and the issues they raise.:

provenance.tsv "provenance_label": There exists at least one file in the dataset that uses each value in the column for its prov- entity label.
- This is similar to the rule we have for participants.tsv that states any value in participant_id column must have a corresponding files in the dataset or be referenced in the phenotype directory. For this to work we had to populate a special field in the dataset context that had information about the names of the subject directories. Populating these fields is not representable in the Schema presently and must be implemented by the validators.
Every id used in all prov files is unique with respect to the dataset.
- When a given file is validated we build a context for it that has all the information it needs to be validated by the schema. Typically this involves loading a sidecar or a handful of specific associated files. This type of pan-dataset assertion is not possible in the current schema.
*_ent.json "locatedAt": Its value exists in the dataset.
- Doable in the current schema.

The following are all similar:

*_ent.json "generatedBy": Its values only reference existing Activity ids
*_act.json "AssociatedWith": Its Values only reference existing Software ids.
*_act.json "Used": Its Values in references only reference existing Environment ids or ProvEntity Ids
anybidsfile.json "GeneratedBy": Its Values only reference existing Activity ids
anyBidsfile.json "SidecarGeneratedBy": Its Values only reference existing Activty ids
- This suffers from the above issue. For any given prov file we are validating we must load arbitrarily many others and check the value at a specific key inside each of their arrays. We could try and come up with new semantics for schema entries that would allow this. Another option would be to come up with a new way in the schema to aggregate values from multiple files into a single place, and then a way of running checks on the aggregated data. But, even if all the, for example, Software objects were gathered in a single place we'd have another problem...
- The expression language is incapable of iterating through an array of objects and running a check on a specific key for each element. We could extend the expression language to have a function like flatten(array: List[List | dict], key: Optional[star]. If the input is an array of arrays then we use normal flatten semantics of putting all elements of all arrays in a single array and return that. If the input is an Array of objects we return an array of each objects value at the key specified.

One thing I like about this proposal is that each json file is simple enough to be immediately understood by a human. I was playing around with alternative ways of organizing data from the examples that might be more amenable to the current expression language and they were all much more difficult to read at a glance. The UID in the Ids makes me think this was not meant to be produced or consumed by humans, but I'm a sucker for looking at any json file that comes across my path.

Please let me know if I have misunderstood/misinterpreted any of the rules from the BEP.

@effigies Any comments on my characterizations of the schema's short comings with respect to the above rules?

@bclenet This only sort of addressed your first two issues, for the remaining four:

sidecar without datafile - We do need to add a field to the schema to indicate this, next schema hack I'll bring it up, and take ownership of adding its interpretation to the javascript validator.
prov subdirectories - I've got a local branch that's capable of doing this, I'll try and push it upstream when its ready.
Macros and CI - I still need to look into these.

Oct 17 '25 21:10 rwblair

Hi @rwblair, Thanks for the comment and review.

Most of the rules you propose are related to the validation of the provenance metadata itself. I believe that these could be brought later to the validator / schema ? I agree that this is easier to validate once the metadata is aggregated; I understood you found the python code to do that. I also plan to write python code for defects detection in a provenance graph (e.g.: isolated nodes, undefined objects, etc.) which could later help for validation.

About Ids, we need to be more precise in the specification about that. The idea is that we want to have BIDS URIs as much as possible, also because it's more human readable. But at the same time, we want to ensure uniqueness of the Id outside the dataset, which is not the case for BIDS URIs as currently defined.

Does it make sense to you ?

Oct 24 '25 08:10 bclenet

@bclenet While thinking about the first two points on how to validate certain parts of the files I started to realize the nature of the changes that will be required of the schema to validate them.

@rwblair thanks a lot for this detailed overview. I second @bclenet's comment in that we were thinking of a much lighter validation using the BIDS schema (mainly making sure that the newly created files had a proper naming and that the keys inside the metadata files corresponded to what is in the spec). I was hoping we could focus on this first before we open the BEP for community review.

@rwblair @effigies: Could we plan a quick meeting to identify the remaining steps we need to follow in order to update the schema for BIDS-Prov (BEP028)? So we can have a clearer plan of when we'll be able to open for community review. Thank you!

Nov 18 '25 14:11 cmaumet

@cmaumet @bclenet Let's schedule a call: https://www.when2meet.com/?33661829-Yw724

Please forward to anybody you want on there.

Nov 20 '25 14:11 effigies

@cmaumet @bclenet Let's schedule a call: https://www.when2meet.com/?33661829-Yw724

@cmaumet Did you want to be on the call?

Yes! I think our messages just crossed each other. I just sent a calendar invite for Dec 4 (4pm my time). Did you get it? (And thanks for following up!)

Nov 24 '25 13:11 effigies

bids-specification bids-specification copied to clipboard

[ENH][BEP028] Specification update for BEP028 BIDS-Prov

bids-specification
bids-specification copied to clipboard