bids-specification icon indicating copy to clipboard operation
bids-specification copied to clipboard

SCHEMA: Expression syntax

Open effigies opened this issue 2 years ago • 15 comments

In order to express many rules that are found in BIDS, it's necessary to be able to evaluate arithmetic and logical expressions on arrays and objects (or their equivalents in other languages).

There are two primary types of expression:

  • Selectors: boolean expressions that indicate whether a rule applies to a specific context
  • Assertions: boolean expressions that the file (or entire dataset) must satisfy in order not to emit a warning or error

We have identified a number of operations that are necessary:

  • Parentheses
  • (Sub)string comparison
  • Arithmetic and numeric comparison
  • Contains/is-contained-by
  • Negation
  • Array indexing
  • Object property lookup (obj.property or obj["property"])
  • Compute result on entire array
    • e.g., max(arr) > 0 or any(arr > 0)
  • Find length of array
    • Possibly also string length and shape of multi-dimensional arrays
    • If needed we could make these properties, so looking up obj.size or obj.shape[0]
  • Find type of object
    • Could be made an obj.type property

We will need a limited language that either is commonly implemented or could be fairly easily implemented in any programming language.

One possibility would be the Common Expression Language. That currently has implementations in Go and Python. We could consider a restricted subset of this language to make reimplementation simpler.

Happy to hear other suggestions.

effigies avatar Apr 16 '22 16:04 effigies

@tsalo made some notes on the sorts of data we need to operate over and operations they support.

The "keywords" are a namespace of values available to an expression.


  • Keywords:
    • dataset
      • Contains dataset_description fields in the object.
      • Summary lists of object types in the dataset?
        • E.g., a list of the modalities under dataset.modalities.
    • sidecar
      • The associated JSON file's contents, as an object.
    • data?
      • The contents of the data file. Can have many different forms, I guess.
    • associated
      • An object of associated files, primarily used in the selectors. The keys of "associated" are generally referenced directly as keywords in the checks.
  • File types:
    • tsv: The contents of a TSV data file.
      • Attributes:
        • .shape: A two-element tuple with the number of rows and columns.
      • Sidecar Attributes:
        • sidecar.Columns: Gets populated automatically, to make it easier to validate tsv and tsv.gz files in the same manner.
    • image: A NIfTI file or something similar.
      • Attributes:
        • .shape
        • .ndim
    • json: A JSON file (sidecar will be a json). This file type is loaded as an "object".
      • See "field types: object" for info about attributes and operators.
  • Field types:
    • object
      • Attributes:
        • .[field]: The keys in the object.
      • Operators:
        • (not) contains
    • array
      • Attributes:
        • .size: The number of elements in the array.
    • integer/number
      • Operators:
        • ==, >, <, >=, <=, !=
    • string
      • Attributes:
        • .pattern: The format or pattern field from the object's schema entry. This may be None.
      • Operators:
        • !=, ==: Direct string comparison
        • (not) contains: A substring is in the string
        • (not) in: string is in an array of strings OR string is a substring in the comparison string
    • boolean
      • Operators:
        • ==: Must be "true" or "false"
        • !=: Hopefully we don't need this
  • Functions:
    • type(): Return the type of the object as a string.

effigies avatar Apr 20 '22 17:04 effigies

For discussion: In the short term, it might be easiest to use eval-able JS expressions and socially enforce that they should look roughly like a subset of CEL or stay restricted to the operators in the previous post.

It will likely be quite a small subset that are not basically the same as Python, and we can make a schema-version-incrementing shift to an independently-specified language after we have a working validator.

effigies avatar Apr 20 '22 21:04 effigies

In https://docs.google.com/spreadsheets/d/1aaizx8iV96u9xHZahJWcV8-bCcn0Jxt7bYq06JsgOAc/edit#gid=1797621391 I have sorted all of the current (as of 73d9968b55ce9b87371522d7efa8cda57978e705) selectors and checks that have been implemented. Here is a summary:

  • Comparisons
    • Mostly == and !=, but we should also permit >, <, >=, <=
  • Boolean expressions
    • Parentheses, &&, ||, !
  • Object key lookup
    • e.g., "VolumeTiming" in sidecar
  • List element lookup
    • e.g., aslcontext.volume_type.includes("m0scan")
  • Substring comparison
    • e.g., entities.task.includes("rest")
  • Type comparison
    • e.g., type(sidecar.FlipAngle) == 'array'

Comparisons and boolean expressions seem straightforward. Using special characters instead of (not, and, and or) is probably less confusing. x in y works nicely in Python and Javascript; could this cause problems with other languages?

For functions, if we would rather do functions than rely on methods like .includes(), here are some thoughts:

Function Definition Example
intersects(<list>, <list>) true if any shared elements, false if none intersects(dataset.modalities, ["PET", "MRI"])
match(<str>, <regex>) Regular expression match match(extension, '.gz$')
type(<var>) One of 'int', 'float', 'array', 'object', 'bool' type(datatypes) == 'list'

Considering #1111, we should add:

Function Definition Example
min(<list>) Minimum value in <list> min(associations.events.duration)
max(<list>) Maximum value in <list> max(associations.events.onset)

One thing to note is that we're often being pretty sloppy about uses of quotes. We're going to need to clearly distinguish identifiers (entities) and strings ("subject").

cc @rwblair @nellh

effigies avatar Aug 19 '22 19:08 effigies

Note that type() is only used in the schema to match object and array at the moment, so we can punt on the question of int/float/number for now.

effigies avatar Aug 25 '22 21:08 effigies

Missed in the above comment:

  • Array length lookup
    • e.g., nifti_header.dim[4] == sidecar.LabelingDuration.length
Function Definition Example
length(<array>) Number of elements in array length(sidecar.LabelingDuration) == nifti_header.dim[4]

effigies avatar Aug 26 '22 01:08 effigies

Clear rules for missing fields and propagation through expressions would be good, so we can tell whether checks like "fieldname" in sidecar are necessary before length(sidecar.fieldname) > 0. I would propose the following results for comparison operators:

Operation Result Comment
null == false false
null == true false
null != false true
null != true true
null == null true
null == 1 false Also <, >, <= and >=
"VolumeTiming" in null false
bool(null) false This is not explicit, but a selector/check with value null evaluates false

And the following for all other operations:

Operation Result Comment
sidecar.MissingValue null
null.anything null
null[0] null
null && true null
null || true null
!null null
intersects(list, null) null Maybe this should be false? Likely to be the "top" operation so little practical impact.
match(null, pattern) null Same as intersects
min(null) null
max(null) null
length(null) null
type(null) "null"

This might not be 100% consistent, and I'm happy to hear alternatives.

effigies avatar Aug 26 '22 12:08 effigies

I like it, fail towards a not true state since selectors need to evaluate to true to trigger for checks. With this could in be understood as syntactic sugar for type(myObj.myKey) != "null" ? or do we expect null literals as valid values anywhere in contexts?

rwblair avatar Aug 26 '22 15:08 rwblair

Not sure where to put this, but annoying difference between python and js:

>>> bool([])
False
> Boolean([])
true

rwblair avatar Aug 26 '22 15:08 rwblair

I like it, fail towards a not true state since selectors need to evaluate to true to trigger for checks. With this could in be understood as syntactic sugar for type(myObj.myKey) != "null" ?

Seems correct. And you could simplify further with myObj.myKey != null.

or do we expect null literals as valid values anywhere in contexts?

I don't think we do. I think the "n/a" string might be accepted in some cases, but I don't remember ever seeing null.

effigies avatar Aug 26 '22 15:08 effigies

Not sure where to put this, but annoying difference between python and js:

>>> bool([])
False
> Boolean([])
true

Could disallow array/object/numeric return values for checks/selectors. Must be true/false/null.

effigies avatar Aug 26 '22 15:08 effigies

Well here's a weird'un:

https://github.com/bids-standard/bids-specification/blob/0ba6eac18629702d20f5a83643c45c8b428edd99/src/schema/rules/sidecars/entity_rules.yaml#L60-L67

Basically it's a restriction on the valid values of units. We could either create a new field Units__phase with enumerated values, or move this into checks and give it its own message.

effigies avatar Aug 26 '22 17:08 effigies

I think going into checks is good. For nirs, sidecar rules that start to get complicated I've been putting in checks.

rwblair avatar Aug 26 '22 17:08 rwblair

Well now I'm waffling because this could be written as a sidecar mutual exclusion style set of rules with the enumerated field. TIMTOWTDI

  • ask me tomorrow and I'll have found a 3rd way to hem and haw on.

rwblair avatar Aug 26 '22 17:08 rwblair

With the ACCELChannelCount and similar, I guess we need a count() function:

Function Definition Example
count(<list>, <val>) Number of instances of <val> in <list> count(columns.type, "ACCEL") == associations.nirs.ACCELChannelCount

On the basis that it's more an operation on the list, I put it first, but if there's a builtin in Javascript that goes the other way, I'm open to swapping the order.

And for completeness:

Operation Result
count(null, val) null
count(list, null) null

effigies avatar Sep 01 '22 13:09 effigies

Doesn't look like its implemented in current validator. My first inclination in JS is to use filter and length: columns.type.filter((x) => x === "ACCEL").length === associations.nirs.ACCELChannelCount

But to implement a filter in our little language we'd either need lambda syntax or an implied variable to use in the conditional to filter on. The jq select operation comes to mind for an example of the second option. Can't link directly to select but its under this section: https://stedolan.github.io/jq/manual/?#Basicfilters

rwblair avatar Sep 01 '22 14:09 rwblair

@rwblair I've found an issue that needs sorting:

VolumeTiming' is not monotonically increasing. 'VolumeTiming' is the time at which each volume was acquired during the acquisition, referring to the start of each readout in the ASL timeseries. Use this field instead of the 'RepetitionTime' field in the case that the ASL timeseries have a non-uniform time distance between acquired volumes. The list must have the same length as the 'sub-

Okay to add sorted() or sort() to the expression language?

effigies avatar Feb 05 '23 18:02 effigies

Sorted [ ] -> boolean seems fine to me. If we add a new function for every rule I suppose it'd still be a win over the old way of implementing rules in the validator.

rwblair avatar Feb 06 '23 16:02 rwblair

I was thinking sorted(array) -> array, and the check is sorted(sidecar.VolumeTiming) == sidecar.VolumeTiming. That seems potentially useful for other contexts, while sorted(array) -> bool seems like it would have fewer use cases.

effigies avatar Feb 06 '23 16:02 effigies

Ah I forgot that == was defined for arrays. Works for me.

rwblair avatar Feb 06 '23 17:02 rwblair