Provide clear(er) and visible to users distinction between columns which could be "overridden" in JSON sidecars and which could not
This issue inspired by
- https://github.com/bids-standard/bids-specification/pull/2046/files#discussion_r1945589499
which is to address
- #2045
to clarify the current state of schema/specification . Thus I do not think this issue is not pertinent to that PR which is not intended to change anything in the schema.
- #1838
introduces notions of "fully-defined" and "conventional" columns, where "fully-defined" are defined through "BIDS schema" directly (type, unit, ...) and "conventional" defined through "JSON Sidecar schema" (in definition).
I think it does indeed make sense to "restrict" some column "properties" to indeed be
- "fully-defined" (e.g. to disallow assignment of "Units" and "Levels" for
_idfields); but still allow to assign "TermURL" or even "Description"; or - "conventional" - as allow to overload any (?) of the properties per our definitions in common principles: tabular files: columns
- "user defined" are pretty much those "conventional".
The rule in #1838 on handling different types of columns says
- If a column has a schema definition (type 1), validate using that schema. Warn on attempted override.
which makes it very user-visible and thus should be depicted in the specification.
But ATM
- in https://bids-specification.readthedocs.io/en/stable/common-principles.html#tabular-files I do not see any notion of different "kinds" of columns, and they all stated to allow e.g.
Unitsspecification, although for some (e.g. all the{entity}_ids makes no sense). - In https://bids-specification.readthedocs.io/en/stable/modality-agnostic-files.html#participants-file I see no statement that somehow
participant_idis "fully defined" and the others in the list are "conventional" - word "conventional" is nowhere in the spec
Related to this, at the level of schema:
-
seeing "JSON-schema" confused me into thinking we are somehow talking about json-schema but it is about "definitions allowed in JSON sidecar files per our spec for .tsv files. I will call it "JSON Sidecar Schema"
-
"explicit better than implicit" so I think reliance on seeing "JSON" record in YAML as indicator is quite suboptimal. I think it would be better to be explicit in the schema and announce what "definitions" (Unit? TermURL? Levels? Description? or all of those) are not "overridable" for a given column
-
partially because of such "hidden semantic" there are entries with just Description overloaded with some shorter one thus causing duplication and ambiguity (which description to use where?)... e.g.
species: name: species display_name: Species description: | The `species` column SHOULD be a binomial species name from the [NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi) (for example, `homo sapiens`, `mus musculus`, `rattus norvegicus`). For backwards compatibility, if `species` is absent, the participant is assumed to be `homo sapiens`. definition: { "Description": "binomial species name from the NCBI Taxonomy (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi)" }
-
-
Related: metadata fields (in .json files) are very alike to tsv columns and already have many characteristics of the 'definition' but only using "schema way" and more flexible at times (
anyOfs) and "Levels" defined via Enums... ref bids-2-devel/issues/85 on metadata "life-cycle" -
It remains unclear why we need both ( as original "YAML schema" vs this "JSON Sidecar Schema"!? since it kinda happened historically -- may be should be harmonized somehow? e.g. go back to "YAML schema" but with explicit listing of what is disallowed and known mapping of what in "JSON sidecar schema" overloads "YAML schema" fields
Before trying to propose solution, I have a few questions on @effigies statements in https://github.com/bids-standard/bids-specification/pull/1838
-
... our column description object is not as powerful as JSON schema
on what aspects are missing from our "BIDS YAML schema" which are present in "JSON Sidecar Schema"? IMHO we should strive to equate them one way (make YAML schema more expressive) or another (just adopt "JSON Sidecar" uniformly).
- what "fully defined" really means, as whether any field is allowed to be overwritten?
seeing "JSON-schema" confused me into thinking we are somehow talking about json-schema but it is about "definitions allowed in JSON sidecar files per our spec for .tsv files. I will call it "JSON Sidecar Schema"
We were talking about json-schema, not "JSON Sidecar Schema". The definitions of fields in objects.metadata and objects.columns use json-schema concepts, and allow us to validate values using json-schema validators.
What you are calling "JSON Sidecar Schema", I would call a "column definition" or "column description". It is in some sense a schema, but its value is almost purely interpretive. It does not permit much validation:
- LongName
- Description
- Levels
- Units
- Delimiter
- TermURL
- HED
Levels can be turned into an enum for validation, and we can use the presence of Units to infer that the value should be numeric, but that's about it.
"explicit better than implicit" so I think reliance on seeing "JSON" record in YAML as indicator is quite suboptimal. I think it would be better to be explicit in the schema and announce what "definitions" (Unit? TermURL? Levels? Description? or all of those) are not "overridable" for a given column
* partially because of such "hidden semantic" there are entries with just Description overloaded with some shorter one thus causing duplication and ambiguity (which description to use where?)... e.g. ``` species: name: species display_name: Species description: | The `species` column SHOULD be a binomial species name from the [NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi) (for example, `homo sapiens`, `mus musculus`, `rattus norvegicus`). For backwards compatibility, if `species` is absent, the participant is assumed to be `homo sapiens`. definition: { "Description": "binomial species name from the NCBI Taxonomy (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi)" } ```
The description field here is what is displayed in the specification:
The "Description" field was taken from the example in Microscopy - Recommended Participants Data:
The idea here is that a tool attempting to interpret a column will first look in the sidecar, and can fall back to objects.columns.${column}.definition.
- Related: metadata fields (in .json files) are very alike to tsv columns and already have many characteristics of the 'definition' but only using "schema way" and more flexible at times (
anyOfs) and "Levels" defined via Enums... ref bids-2-devel/issues/85 on metadata "life-cycle"
BIDS has no meta-language for datasets to define their own terms in sidecar files, as it does for TSV columns, so there is nothing to harmonize. json-schema is a natural choice for validating JSON sidecars.
- It remains unclear why we need both ( as original "YAML schema" vs this "JSON Sidecar Schema"!? since it kinda happened historically -- may be should be harmonized somehow? e.g. go back to "YAML schema" but with explicit listing of what is disallowed and known mapping of what in "JSON sidecar schema" overloads "YAML schema" fields
I'm probably repeating myself, but this happened because BIDS allows columns to be overridden, and people wanted to be able to validate their columns based on those definitions. We therefore needed to come up with a method of validating custom columns, and deal with the fact that there are some special "pre-defined custom" columns.
We do not want to penalize people for providing more correct specifications of their columns by imposing a different validation technique, so encoding what we could in the sidecar "column description" format achieves the following:
- It provides a baseline definition that people can copy-paste and modify as needed.
- It provides a fallback definition that tools can use when no custom definition is found in the dataset.
- It forces us to eat our own dogfood in both the schema and validator. If we're telling people to use this technique we won't use in the schema and the validator won't be exercised on by default, then we're admitting its inadequacy and inviting bugs.
Before trying to propose solution, I have a few questions on @effigies statements in #1838
* > ... our column description object is not as powerful as JSON schemaon what aspects are missing from our "BIDS YAML schema" which are present in "JSON Sidecar Schema"? IMHO we should strive to equate them one way (make YAML schema more expressive) or another (just adopt "JSON Sidecar" uniformly).
I believe this expresses a confusion that I have addressed above.
I limited the above post to clarifications and reasoning. Looking forward:
If we want to re-converge, then I think we need a way of making the custom column descriptions that are found in datasets isomorphic to the default column descriptions found in the schema. We cannot simply use the same structure, for backwards compatibility's sake, so we need a mapping.
The column descriptions in BIDS Schema contain JSON-schema terms (e.g., type: number) that allow us to validate values along with interpretive terms (e.g., unit: mm) that are used in rendering and may be used by downstream tools.
The column descriptions in sidecar files contain a broader set of interpretive terms, and a much narrower set of terms that can be used to validate values.
Full isomorphism is not necessary, but we should be able to start at either the "YAML definition" or the "Sidecar description" node, and, following any path, get the same result for "JSON schema" and "Interpretive fields":
graph TD
yaml[YAML definition] --prune--> schema[JSON schema]
yaml --convert--> sidecar[Sidecar description] --convert--> yaml & schema & interp[Interpretive fields]
yaml --prune--> interp
Once we can do that, unification is possible. If we cannot, there will continue to be two types of columns.
In brief (@effigies and/or @CodyCBakerPhD could provide more detail or correct me) but on 20250801, somewhere not so deep in NH (and/or VT), we agreed that a viable path forward is to
- extend sidecar JSON definition with missing elements (like minimal/maximal)
- move to use sidecar JSON definition for everything (potentially properly migrating the schema, may be even adding such migration into a new "schema_migrate" helper to be able to load/migrate older schema versions)
One touched aspect was also type vs format and IIRC we boiled down to
- use formats as they also determine data storage type (str or float or int or bool), and semantic treatment (e.g. parsing of datetime str)
- do not bother allowing extending them ATM with regexes ATM.
- do not bother allowing multiple formats ATM (later could be extended with supporting lists)
I'm including notes from discussion with @yarikoptic and @CodyCBakerPhD on Friday, below. A couple final questions:
- Do we want to adopt a single form for schema and sidecar definitions? If so, do we want:
- JSON-Schema-compatible definition, converted on-the-fly to sidecar by tools that need it?
- Sidecar-style definition, converted on-the-fly to JSON schema for validation for tools that need it?
- A superset definition that can be converted to either sidecar or JSON schema?
The problem with (1.i) is that level descriptions are not possible with enums, but as long as we cross-reference with objects.enums, it should be fine (unless there are two objects in objects.enums with the same values). (1.ii) would be pretty simple, although how we handle things like participant_id with pattern: "^sub-[a-zA-Z0-9]+$ is unclear. (1.iii) would allow us to iterate on the schema without being tied to either JSON schema or the sidecar definition fields.
For example,
https://github.com/bids-standard/bids-specification/blob/6f790a7d612fd42eb38608319e5f7cc0022a9e17/src/schema/objects/columns.yaml#L76-L91
Right now this is (1.i) and we could create Levels with empty strings as descriptions, but what if instead we wrote:
component:
name: component
display_name: Component
description: |
Description of the spatial axis or label of quaternion component associated with the channel.
For example, `x`,`y`,`z` for position channels,
or `quat_x`, `quat_y`, `quat_z`, `quat_w` for quaternion orientation channels.
type: string
levels: [x, y, z, quat_x, quat_y, quat_z, quat_w] # keys in objects.enums
This would translate to:
{
"component": {
"LongName": "Component",
"Description": "Description of the spatial axis or label of quaternion component associated with the channel. For example, `x`,`y`,`z` for position channels, or `quat_x`, `quat_y`, `quat_z`, `quat_w` for quaternion orientation channels."
"Format": "string",
"Levels": {
"x": "The x dimension of the coordinate system.",
"y": "The y dimension of the coordinate system.",
"z": "The z dimension of the coordinate system.",
"quat_x": "The quaternion x dimension of the coordinate system.",
"quat_y": "The quaternion y dimension of the coordinate system.",
"quat_z": "The quaternion z dimension of the coordinate system.",
"quat_w": "The quaternion w dimension of the coordinate system."
}
}
}
And JSON schema:
{
"type": "string",
"enum": [
"x",
"y",
"z",
"quat_x",
"quat_y",
"quat_z",
"quat_w",
]
}
Notes from Friday's discussion
| Attribute | BIDS Schema | Sidecar JSON definition | Kind |
|---|---|---|---|
| Name | name |
Key | Interpretive |
| Long name | display_name |
LongName |
Interpretive |
| Description | description |
Description |
Interpretive |
| Type | type ("string", "number", "integer", "boolean") |
Implicit; number (if units) or string (else) | Validatable |
| Format | format/pattern |
None | Validatable |
| Units | unit |
Units |
Interpretive |
| Enum | enum |
Levels keys |
Validatable |
| Level descriptions | None | Levels values |
Interpretive |
| Minimum | minimum |
None | Validatable |
| Maximum | maximum |
None | Validatable |
Conclusions
- 89+ hard-coded in validator as valid for age (for some range of BIDS)
- Not in schema
- Minimum/Maximum added to sidecar definitions
- "Format" added to sidecar using fields in
objects.formats- Types are a subset of formats
- Fall back to implicit string/number
- "Levels" remains permissible when when "Format"/"Units" defined
- Deprecate 89+ in spec, setting maximum to 89.