bids-specification
bids-specification copied to clipboard
SCHEMA: Expression syntax
In order to express many rules that are found in BIDS, it's necessary to be able to evaluate arithmetic and logical expressions on arrays and objects (or their equivalents in other languages).
There are two primary types of expression:
- Selectors: boolean expressions that indicate whether a rule applies to a specific context
- Assertions: boolean expressions that the file (or entire dataset) must satisfy in order not to emit a warning or error
We have identified a number of operations that are necessary:
- Parentheses
- (Sub)string comparison
- Arithmetic and numeric comparison
- Contains/is-contained-by
- Negation
- Array indexing
- Object property lookup (
obj.property
orobj["property"]
) - Compute result on entire array
- e.g.,
max(arr) > 0
orany(arr > 0)
- e.g.,
- Find length of array
- Possibly also string length and shape of multi-dimensional arrays
- If needed we could make these properties, so looking up
obj.size
orobj.shape[0]
- Find type of object
- Could be made an
obj.type
property
- Could be made an
We will need a limited language that either is commonly implemented or could be fairly easily implemented in any programming language.
One possibility would be the Common Expression Language. That currently has implementations in Go and Python. We could consider a restricted subset of this language to make reimplementation simpler.
Happy to hear other suggestions.
@tsalo made some notes on the sorts of data we need to operate over and operations they support.
The "keywords" are a namespace of values available to an expression.
- Keywords:
- dataset
- Contains dataset_description fields in the object.
- Summary lists of object types in the dataset?
- E.g., a list of the modalities under
dataset.modalities
.
- E.g., a list of the modalities under
- sidecar
- The associated JSON file's contents, as an object.
- data?
- The contents of the data file. Can have many different forms, I guess.
- associated
- An object of associated files, primarily used in the selectors. The keys of "associated" are generally referenced directly as keywords in the checks.
- dataset
- File types:
- tsv: The contents of a TSV data file.
- Attributes:
- .shape: A two-element tuple with the number of rows and columns.
- Sidecar Attributes:
- sidecar.Columns: Gets populated automatically, to make it easier to validate tsv and tsv.gz files in the same manner.
- Attributes:
- image: A NIfTI file or something similar.
- Attributes:
- .shape
- .ndim
- Attributes:
- json: A JSON file (sidecar will be a json). This file type is loaded as an "object".
- See "field types: object" for info about attributes and operators.
- tsv: The contents of a TSV data file.
- Field types:
- object
- Attributes:
- .[field]: The keys in the object.
- Operators:
- (not) contains
- Attributes:
- array
- Attributes:
- .size: The number of elements in the array.
- Attributes:
- integer/number
- Operators:
- ==, >, <, >=, <=, !=
- Operators:
- string
- Attributes:
- .pattern: The format or pattern field from the object's schema entry. This may be None.
- Operators:
- !=, ==: Direct string comparison
- (not) contains: A substring is in the string
- (not) in: string is in an array of strings OR string is a substring in the comparison string
- Attributes:
- boolean
- Operators:
- ==: Must be "true" or "false"
- !=: Hopefully we don't need this
- Operators:
- object
- Functions:
- type(): Return the type of the object as a string.
For discussion: In the short term, it might be easiest to use eval
-able JS expressions and socially enforce that they should look roughly like a subset of CEL or stay restricted to the operators in the previous post.
It will likely be quite a small subset that are not basically the same as Python, and we can make a schema-version-incrementing shift to an independently-specified language after we have a working validator.
In https://docs.google.com/spreadsheets/d/1aaizx8iV96u9xHZahJWcV8-bCcn0Jxt7bYq06JsgOAc/edit#gid=1797621391 I have sorted all of the current (as of 73d9968b55ce9b87371522d7efa8cda57978e705) selectors and checks that have been implemented. Here is a summary:
- Comparisons
- Mostly
==
and!=
, but we should also permit>
,<
,>=
,<=
- Mostly
- Boolean expressions
- Parentheses,
&&
,||
,!
- Parentheses,
- Object key lookup
- e.g.,
"VolumeTiming" in sidecar
- e.g.,
- List element lookup
- e.g.,
aslcontext.volume_type.includes("m0scan")
- e.g.,
- Substring comparison
- e.g.,
entities.task.includes("rest")
- e.g.,
- Type comparison
- e.g.,
type(sidecar.FlipAngle) == 'array'
- e.g.,
Comparisons and boolean expressions seem straightforward. Using special characters instead of (not
, and
, and or
) is probably less confusing. x in y
works nicely in Python and Javascript; could this cause problems with other languages?
For functions, if we would rather do functions than rely on methods like .includes()
, here are some thoughts:
Function | Definition | Example |
---|---|---|
intersects(<list>, <list>) |
true if any shared elements, false if none |
intersects(dataset.modalities, ["PET", "MRI"]) |
match(<str>, <regex>) |
Regular expression match | match(extension, '.gz$') |
type(<var>) |
One of 'int' , 'float' , 'array' , 'object' , 'bool' |
type(datatypes) == 'list' |
Considering #1111, we should add:
Function | Definition | Example |
---|---|---|
min(<list>) |
Minimum value in <list> |
min(associations.events.duration) |
max(<list>) |
Maximum value in <list> |
max(associations.events.onset) |
One thing to note is that we're often being pretty sloppy about uses of quotes. We're going to need to clearly distinguish identifiers (entities
) and strings ("subject"
).
cc @rwblair @nellh
Note that type()
is only used in the schema to match object
and array
at the moment, so we can punt on the question of int
/float
/number
for now.
Missed in the above comment:
- Array length lookup
- e.g.,
nifti_header.dim[4] == sidecar.LabelingDuration.length
- e.g.,
Function | Definition | Example |
---|---|---|
length(<array>) |
Number of elements in array | length(sidecar.LabelingDuration) == nifti_header.dim[4] |
Clear rules for missing fields and propagation through expressions would be good, so we can tell whether checks like "fieldname" in sidecar
are necessary before length(sidecar.fieldname) > 0
. I would propose the following results for comparison operators:
Operation | Result | Comment |
---|---|---|
null == false |
false |
|
null == true |
false |
|
null != false |
true |
|
null != true |
true |
|
null == null |
true |
|
null == 1 |
false |
Also < , > , <= and >= |
"VolumeTiming" in null |
false |
|
bool(null) |
false |
This is not explicit, but a selector/check with value null evaluates false |
And the following for all other operations:
Operation | Result | Comment |
---|---|---|
sidecar.MissingValue |
null |
|
null.anything |
null |
|
null[0] |
null |
|
null && true |
null |
|
null || true |
null |
|
!null |
null |
|
intersects(list, null) |
null |
Maybe this should be false ? Likely to be the "top" operation so little practical impact. |
match(null, pattern) |
null |
Same as intersects |
min(null) |
null |
|
max(null) |
null |
|
length(null) |
null |
|
type(null) |
"null" |
This might not be 100% consistent, and I'm happy to hear alternatives.
I like it, fail towards a not true state since selectors need to evaluate to true to trigger for checks. With this could in
be understood as syntactic sugar for type(myObj.myKey) != "null"
? or do we expect null literals as valid values anywhere in contexts?
Not sure where to put this, but annoying difference between python and js:
>>> bool([])
False
> Boolean([])
true
I like it, fail towards a not true state since selectors need to evaluate to true to trigger for checks. With this could
in
be understood as syntactic sugar fortype(myObj.myKey) != "null"
?
Seems correct. And you could simplify further with myObj.myKey != null
.
or do we expect null literals as valid values anywhere in contexts?
I don't think we do. I think the "n/a" string might be accepted in some cases, but I don't remember ever seeing null.
Not sure where to put this, but annoying difference between python and js:
>>> bool([]) False
> Boolean([]) true
Could disallow array/object/numeric return values for checks/selectors. Must be true/false/null.
Well here's a weird'un:
https://github.com/bids-standard/bids-specification/blob/0ba6eac18629702d20f5a83643c45c8b428edd99/src/schema/rules/sidecars/entity_rules.yaml#L60-L67
Basically it's a restriction on the valid values of units
. We could either create a new field Units__phase
with enumerated values, or move this into checks
and give it its own message.
I think going into checks is good. For nirs, sidecar rules that start to get complicated I've been putting in checks.
Well now I'm waffling because this could be written as a sidecar mutual exclusion style set of rules with the enumerated field. TIMTOWTDI
- ask me tomorrow and I'll have found a 3rd way to hem and haw on.
With the ACCELChannelCount
and similar, I guess we need a count()
function:
Function | Definition | Example |
---|---|---|
count(<list>, <val>) |
Number of instances of <val> in <list> |
count(columns.type, "ACCEL") == associations.nirs.ACCELChannelCount |
On the basis that it's more an operation on the list, I put it first, but if there's a builtin in Javascript that goes the other way, I'm open to swapping the order.
And for completeness:
Operation | Result |
---|---|
count(null, val) |
null |
count(list, null) |
null |
Doesn't look like its implemented in current validator. My first inclination in JS is to use filter and length:
columns.type.filter((x) => x === "ACCEL").length === associations.nirs.ACCELChannelCount
But to implement a filter in our little language we'd either need lambda syntax or an implied variable to use in the conditional to filter on. The jq select operation comes to mind for an example of the second option. Can't link directly to select but its under this section: https://stedolan.github.io/jq/manual/?#Basicfilters
@rwblair I've found an issue that needs sorting:
VolumeTiming' is not monotonically increasing. 'VolumeTiming' is the time at which each volume was acquired during the acquisition, referring to the start of each readout in the ASL timeseries. Use this field instead of the 'RepetitionTime' field in the case that the ASL timeseries have a non-uniform time distance between acquired volumes. The list must have the same length as the 'sub-
Okay to add sorted()
or sort()
to the expression language?
Sorted [ ] -> boolean
seems fine to me. If we add a new function for every rule I suppose it'd still be a win over the old way of implementing rules in the validator.
I was thinking sorted(array) -> array
, and the check is sorted(sidecar.VolumeTiming) == sidecar.VolumeTiming
. That seems potentially useful for other contexts, while sorted(array) -> bool
seems like it would have fewer use cases.
Ah I forgot that ==
was defined for arrays. Works for me.