bids-specification SCHEMA: Expression syntax

In order to express many rules that are found in BIDS, it's necessary to be able to evaluate arithmetic and logical expressions on arrays and objects (or their equivalents in other languages).

There are two primary types of expression:

Selectors: boolean expressions that indicate whether a rule applies to a specific context
Assertions: boolean expressions that the file (or entire dataset) must satisfy in order not to emit a warning or error

We have identified a number of operations that are necessary:

Parentheses
(Sub)string comparison
Arithmetic and numeric comparison
Contains/is-contained-by
Negation
Array indexing
Object property lookup (obj.property or obj["property"])
Compute result on entire array
- e.g., max(arr) > 0 or any(arr > 0)
Find length of array
- Possibly also string length and shape of multi-dimensional arrays
- If needed we could make these properties, so looking up obj.size or obj.shape[0]
Find type of object
- Could be made an obj.type property

We will need a limited language that either is commonly implemented or could be fairly easily implemented in any programming language.

One possibility would be the Common Expression Language. That currently has implementations in Go and Python. We could consider a restricted subset of this language to make reimplementation simpler.

Happy to hear other suggestions.

Apr 16 '22 16:04 effigies

@tsalo made some notes on the sorts of data we need to operate over and operations they support.

The "keywords" are a namespace of values available to an expression.

Keywords:
- dataset
  - Contains dataset_description fields in the object.
  - Summary lists of object types in the dataset?
    - E.g., a list of the modalities under dataset.modalities.
- sidecar
  - The associated JSON file's contents, as an object.
- data?
  - The contents of the data file. Can have many different forms, I guess.
- associated
  - An object of associated files, primarily used in the selectors. The keys of "associated" are generally referenced directly as keywords in the checks.
File types:
- tsv: The contents of a TSV data file.
  - Attributes:
    - .shape: A two-element tuple with the number of rows and columns.
  - Sidecar Attributes:
    - sidecar.Columns: Gets populated automatically, to make it easier to validate tsv and tsv.gz files in the same manner.
- image: A NIfTI file or something similar.
  - Attributes:
    - .shape
    - .ndim
- json: A JSON file (sidecar will be a json). This file type is loaded as an "object".
  - See "field types: object" for info about attributes and operators.
Field types:
- object
  - Attributes:
    - .[field]: The keys in the object.
  - Operators:
    - (not) contains
- array
  - Attributes:
    - .size: The number of elements in the array.
- integer/number
  - Operators:
    - ==, >, <, >=, <=, !=
- string
  - Attributes:
    - .pattern: The format or pattern field from the object's schema entry. This may be None.
  - Operators:
    - !=, ==: Direct string comparison
    - (not) contains: A substring is in the string
    - (not) in: string is in an array of strings OR string is a substring in the comparison string
- boolean
  - Operators:
    - ==: Must be "true" or "false"
    - !=: Hopefully we don't need this
Functions:
- type(): Return the type of the object as a string.

Apr 20 '22 17:04 effigies

For discussion: In the short term, it might be easiest to use eval-able JS expressions and socially enforce that they should look roughly like a subset of CEL or stay restricted to the operators in the previous post.

It will likely be quite a small subset that are not basically the same as Python, and we can make a schema-version-incrementing shift to an independently-specified language after we have a working validator.

Apr 20 '22 21:04 effigies

In https://docs.google.com/spreadsheets/d/1aaizx8iV96u9xHZahJWcV8-bCcn0Jxt7bYq06JsgOAc/edit#gid=1797621391 I have sorted all of the current (as of 73d9968b55ce9b87371522d7efa8cda57978e705) selectors and checks that have been implemented. Here is a summary:

Comparisons
- Mostly == and !=, but we should also permit >, <, >=, <=
Boolean expressions
- Parentheses, &&, ||, !
Object key lookup
- e.g., "VolumeTiming" in sidecar
List element lookup
- e.g., aslcontext.volume_type.includes("m0scan")
Substring comparison
- e.g., entities.task.includes("rest")
Type comparison
- e.g., type(sidecar.FlipAngle) == 'array'

Comparisons and boolean expressions seem straightforward. Using special characters instead of (not, and, and or) is probably less confusing. x in y works nicely in Python and Javascript; could this cause problems with other languages?

For functions, if we would rather do functions than rely on methods like .includes(), here are some thoughts:

Function	Definition	Example
`intersects(<list>, <list>)`	`true` if any shared elements, `false` if none	`intersects(dataset.modalities, ["PET", "MRI"])`
`match(<str>, <regex>)`	Regular expression match	`match(extension, '.gz$')`
`type(<var>)`	One of `'int'`, `'float'`, `'array'`, `'object'`, `'bool'`	`type(datatypes) == 'list'`

Considering #1111, we should add:

Function	Definition	Example
`min(<list>)`	Minimum value in `<list>`	`min(associations.events.duration)`
`max(<list>)`	Maximum value in `<list>`	`max(associations.events.onset)`

One thing to note is that we're often being pretty sloppy about uses of quotes. We're going to need to clearly distinguish identifiers (entities) and strings ("subject").

cc @rwblair @nellh

Aug 19 '22 19:08 effigies

Note that type() is only used in the schema to match object and array at the moment, so we can punt on the question of int/float/number for now.

Aug 25 '22 21:08 effigies

Missed in the above comment:

Array length lookup
- e.g., nifti_header.dim[4] == sidecar.LabelingDuration.length

Function	Definition	Example
`length(<array>)`	Number of elements in array	`length(sidecar.LabelingDuration) == nifti_header.dim[4]`

Aug 26 '22 01:08 effigies

Clear rules for missing fields and propagation through expressions would be good, so we can tell whether checks like "fieldname" in sidecar are necessary before length(sidecar.fieldname) > 0. I would propose the following results for comparison operators:

Operation	Result	Comment
`null == false`	`false`
`null == true`	`false`
`null != false`	`true`
`null != true`	`true`
`null == null`	`true`
`null == 1`	`false`	Also `<`, `>`, `<=` and `>=`
`"VolumeTiming" in null`	`false`
`bool(null)`	`false`	This is not explicit, but a selector/check with value `null` evaluates `false`

And the following for all other operations:

Operation	Result	Comment
`sidecar.MissingValue`	`null`
`null.anything`	`null`
`null[0]`	`null`
`null && true`	`null`
`null \|\| true`	`null`
`!null`	`null`
`intersects(list, null)`	`null`	Maybe this should be `false`? Likely to be the "top" operation so little practical impact.
`match(null, pattern)`	`null`	Same as intersects
`min(null)`	`null`
`max(null)`	`null`
`length(null)`	`null`
`type(null)`	`"null"`

This might not be 100% consistent, and I'm happy to hear alternatives.

Aug 26 '22 12:08 effigies

I like it, fail towards a not true state since selectors need to evaluate to true to trigger for checks. With this could in be understood as syntactic sugar for type(myObj.myKey) != "null" ? or do we expect null literals as valid values anywhere in contexts?

Aug 26 '22 15:08 rwblair

Not sure where to put this, but annoying difference between python and js:

>>> bool([])
False

> Boolean([])
true

Aug 26 '22 15:08 rwblair

I like it, fail towards a not true state since selectors need to evaluate to true to trigger for checks. With this could in be understood as syntactic sugar for type(myObj.myKey) != "null" ?

Seems correct. And you could simplify further with myObj.myKey != null.

or do we expect null literals as valid values anywhere in contexts?

I don't think we do. I think the "n/a" string might be accepted in some cases, but I don't remember ever seeing null.

Aug 26 '22 15:08 effigies

Not sure where to put this, but annoying difference between python and js:
>>> bool([])
False
> Boolean([])
true

Could disallow array/object/numeric return values for checks/selectors. Must be true/false/null.

Aug 26 '22 15:08 effigies

Well here's a weird'un:

https://github.com/bids-standard/bids-specification/blob/0ba6eac18629702d20f5a83643c45c8b428edd99/src/schema/rules/sidecars/entity_rules.yaml#L60-L67

Basically it's a restriction on the valid values of units. We could either create a new field Units__phase with enumerated values, or move this into checks and give it its own message.

Aug 26 '22 17:08 effigies

I think going into checks is good. For nirs, sidecar rules that start to get complicated I've been putting in checks.

Aug 26 '22 17:08 rwblair

Well now I'm waffling because this could be written as a sidecar mutual exclusion style set of rules with the enumerated field. TIMTOWTDI

ask me tomorrow and I'll have found a 3rd way to hem and haw on.

Aug 26 '22 17:08 rwblair

With the ACCELChannelCount and similar, I guess we need a count() function:

Function	Definition	Example
`count(<list>, <val>)`	Number of instances of `<val>` in `<list>`	`count(columns.type, "ACCEL") == associations.nirs.ACCELChannelCount`

On the basis that it's more an operation on the list, I put it first, but if there's a builtin in Javascript that goes the other way, I'm open to swapping the order.

And for completeness:

Operation	Result
`count(null, val)`	`null`
`count(list, null)`	`null`

Sep 01 '22 13:09 effigies

Doesn't look like its implemented in current validator. My first inclination in JS is to use filter and length: columns.type.filter((x) => x === "ACCEL").length === associations.nirs.ACCELChannelCount

But to implement a filter in our little language we'd either need lambda syntax or an implied variable to use in the conditional to filter on. The jq select operation comes to mind for an example of the second option. Can't link directly to select but its under this section: https://stedolan.github.io/jq/manual/?#Basicfilters

Sep 01 '22 14:09 rwblair

@rwblair I've found an issue that needs sorting:

VolumeTiming' is not monotonically increasing. 'VolumeTiming' is the time at which each volume was acquired during the acquisition, referring to the start of each readout in the ASL timeseries. Use this field instead of the 'RepetitionTime' field in the case that the ASL timeseries have a non-uniform time distance between acquired volumes. The list must have the same length as the 'sub-[_ses-][_acq-][_rec-][_run-]_aslcontext.tsv', and the numbers must be non-negative and monotonically increasing. If 'VolumeTiming' is defined, this requires acquisition time (TA) to be defined via 'AcquisitionDuration'.

Okay to add sorted() or sort() to the expression language?

Feb 05 '23 18:02 effigies

Sorted [ ] -> boolean seems fine to me. If we add a new function for every rule I suppose it'd still be a win over the old way of implementing rules in the validator.

Feb 06 '23 16:02 rwblair

I was thinking sorted(array) -> array, and the check is sorted(sidecar.VolumeTiming) == sidecar.VolumeTiming. That seems potentially useful for other contexts, while sorted(array) -> bool seems like it would have fewer use cases.

Feb 06 '23 16:02 effigies

Ah I forgot that == was defined for arrays. Works for me.

Feb 06 '23 17:02 rwblair

bids-specification bids-specification copied to clipboard

SCHEMA: Expression syntax

bids-specification
bids-specification copied to clipboard