kaitai_struct icon indicating copy to clipboard operation
kaitai_struct copied to clipboard

Validation of parsed data with assertions

Open GreyCat opened this issue 6 years ago • 25 comments

Extracting easier part of #81, with a concrete proposal on implementation.

Context: we want to have some sort of way to state that certain attributes have some sort of "valid" values. Stating that has two implications:

  1. All values which are "not valid" must trigger an error of parsing
  2. In some cases, we can calculate values automatically during generation of stream

The simplest existing implementation is contents: it is essentially a byte array, which validates that its contents are exactly as specified in ksy file. On parsing, we'll throw an error if byte array contents won't match the expected value. On generation, we can just write the expected value, and not bother the end-user with setting the attribute manually.

Syntax proposal

Add valid key to the attribute spec. Inside it, there could be:

  • A string, integer or boolean => it will be treated as expression that's supposed to be the only valid value of this attribute. For example:
- id: foo
  type: u4le
  valid: 0x42 # expected in stream: 42 00 00 00
- id: bar
  type: strz
  valid: '"abcd"' # expected in stream: 61 62 63 64 00
- id: baz
  type: u2le
  valid: '2 * foo' # expected in stream: 84 00
  • A map, which can feature keys:
    • eq: expected states that attribute value must be equal to given value (i.e. the same as above)
    • min: expected states minimum valid value for this attribute (compared with a >= expected)
    • max: expected states maximum valid value for this attribute (compared with a <= expected)
    • any-of: list-of-expected states that valid value must be equal to one in the given list of expressions
    • expr: expression states that given expression need to be evaluated (substituting _ with an actual attribute value) to be true to treat attribute value as valid.

All expected and expression values are KS expression strings, which can be constants or can depend on other attributes. list-of-expected must be a YAML array of KS expression strings. expected inferred type must be comparable with type of the attribute. expression inferred type must be boolean.

GreyCat avatar May 15 '18 18:05 GreyCat

You have suggested to check such relations using syntax in YAML. Having this syntax in YAML has both drawbacks and profits. The drawback is that we have to support it alongside with usual expressions. The profit is that it is yaml, any tools working with yaml would be able to parse it.

So shouod we drop our current expr syntax and start using entirely yaml-based?

KOLANICH avatar May 16 '18 07:05 KOLANICH

So shouod we drop our current expr syntax

What exactly is "our current expr syntax", and what do you mean by "dropping" it? We had no syntax proposal for validation until now, and I suspect what you imply is covered in valid/expr, for example, checking that value is even:

- id: foo
  type: u4
  valid:
    expr: _ % 2 == 0

GreyCat avatar May 16 '18 09:05 GreyCat

A map, which can feature keys

If the field is supposed to contain an enum value, can I check that by specifying a key of enum, or can any-of take the enum type-name instead of a list?

webbnh avatar May 16 '18 16:05 webbnh

@webbnh Probably for enums we can just do the following:

  1. Ensure that in all languages "invalid" enum values does not trigger a error by default
  2. Add extra key here, something like that:
- id: foo
  type: u4
  enum: bar
  valid:
    enum-key: true

GreyCat avatar May 16 '18 17:05 GreyCat

What exactly is "our current expr syntax", and what do you mean by "dropping" it?

What you have proposed can be expressed as a logical expression, assumming that we have some facilities. eq: expected <=>

throw:
  if: _ != expected

min: expected <=> if: _ < expected max: expected <=> if: _ > expected any-of, enum-key <=> #434

But we can go from another side. We can eliminate textual expressions and force the developer into using YAML AST.

for example if: a != 0 and b > _io.size would be something like

size:
  and:
    - '!=':
      - a
      - 0
    - '>':
      - b
      - '_io.size'

I know that it is monstrouous space-wasting and unusable, but I'm not sure ;)

KOLANICH avatar May 16 '18 18:05 KOLANICH

My point here is actually very simple: if we'll just do it as one expression — which we actually still can do, i.e. stuff like that is legal:

valid:
  expr: _ == 42 # the same as `eq: 42`
valid:
  expr: _ >= 42 # the same as `min: 42`

It's easy to do, but we lose declarativity this way. If it's only an expression, one can't just look at this validation spec and render it as a range input (for example, if a min/max boundaries are known), or as a combo box with a set of choices (if a set of choices is known). Basically, if we only have a validation expression, the only thing we can do (without reversing it and applying some symbolic solver) is checking every particular value for validity, and that's it. Declarative configuration offers more, for example, you can clearly get a list of valid values, or boundaries, or stuff like that. Proposed mechanism is also extensible. For example, if we'll want to add size limits of strings in future, we can do something like

valid:
  max-size: 42

and it could be rendered as string with limited input box, not asking a user to enter arbitrary string and then rejecting it for some mysterious reason.

GreyCat avatar May 17 '18 14:05 GreyCat

I haven't thought about using that info for metadata for GUI purposes. It's cool idea. In this case

reversing it and applying some symbolic solver

IMHO is the best way to do it. If we have problems with performance we can try to cache solver results.

It's easy to do, but we lose declarativity this way.

I guess we don't lose declarativity, since we still declare "this value must satisfy this logical expression" and all the needed info is there, but we lose some verbosity. What we win, is that we don't make an assumption about programmer using the proposed verbose syntax instead of exprs. So even if a programmer used exprs, if we have a system recovering these info from exprs, we would expect it working correctly, even retroactively, when a new extractor is added and it becomes working even on old specs without any modification

KOLANICH avatar May 17 '18 23:05 KOLANICH

Please take into account bit sized values like:

- id: fixed_10
  type: b2
  contents: [2]

- id: fixed_10
  contents: [true, false]

kalidasya avatar May 23 '19 10:05 kalidasya

I went forward and do some proof-of-concept implementation in current compiler. Namely:

  • valid is now recognized and parsed in its short form => only equals is supported now, but it's really easy to extend it to support ranges, lists, arbitrary expressions, etc.
  • There is a system of relevant ValidationSpec objects created during KSY loading
  • For Java, Ruby, Python, JS: codegen actually generates validation code
  • Added 3 tests valid_*.ksy:
  • Added a more or less formal system of exceptions specific to KS
    • Implemented it in Ruby, Python, Java; JS is on the way
    • Added relevant knowledge of this system of exceptions into ksc

GreyCat avatar Jul 06 '19 05:07 GreyCat

The current KSC (0.9-SNAPSHOT) generates incorrect code when valid is used on an optional field (with if). The validation check is executed outside the if clause instead of inside it.

Test code:

meta:
  id: valid_if_test
seq:
  - id: optional
    if: false
    type: u1
    valid: 42

produces (in JavaScript, but it happens in all languages):

ValidIfTest.prototype._read = function() {
  if (false) {
    this.optional = this._io.readU1();
  }
  if (!(this.optional == 42)) {
    throw new KaitaiStream.ValidationNotEqualError(42, this.optional, this._io, "/seq/0");
  }
}

generalmimon avatar Oct 07 '19 22:10 generalmimon

@generalmimon

The current KSC (0.9-SNAPSHOT) generates incorrect code when valid is used on an optional field (with if). The validation check is executed outside the if clause instead of inside it.

Probably moving it inside if would make sense, good catch! Can I ask you to contribute relevant test to tests repo?

GreyCat avatar Oct 19 '19 12:10 GreyCat

Sure.

generalmimon avatar Oct 19 '19 13:10 generalmimon

Recent compiler updates fixed ValidFailInst and ValidNotParsedIf tests by @generalmimon. Thanks for enhancing our test base!

GreyCat avatar Nov 25 '19 02:11 GreyCat

Not closely related, but you may find it interesting how CUE unifies types and constrains: https://cuelang.org/docs/concepts/logic/

suhr avatar Dec 13 '19 15:12 suhr

Implemented min, max and any-of clauses. Works for C++, JS, Python, Ruby. Pending runtime fixes for Java.

GreyCat avatar Dec 28 '19 12:12 GreyCat

May be name key expect? I think, is sounds better:

id: answer
expect: 0x42
expect:
  min: 42
  max: 42
  any-of: [answer]
  expr: _ == 42

Mingun avatar Jan 11 '20 16:01 Mingun

AFAIK, this feature is basically implemented (all valid sub-keys described in https://github.com/kaitai-io/kaitai_struct/issues/435#issue-323332353 work), but no proper type checks if the type of the attribute can be compared with the type of the expressions in valid sub-keys (like eq, min, any-of, ...) are done yet. For example, this spec should raise the same "can't compare ... and ..." error as the == operator currently does:

meta:
  id: attr_valid_eq_bad
seq:
  - id: foo
    type: u1
    valid:
      eq: '"bar"'

So that means implementing the checks in compiler + adding relevant ErrorMessagesSpecs testing if the error message is correct. I think it makes sense to make the message look like that one in the existing test for == type mismatch: tests/formats_err/expr_compare_enum2.ksy:1

However, I think that this little thing can wait until we release the 0.9 version, so I'm moving this issue from the 0.9 milestone to 0.10.

generalmimon avatar Jul 31 '20 20:07 generalmimon

Given that the keyword is mostly implemented, is it time to add it to the .ksy schema? This would be beneficial for users (like me) that use a YAML language server for validation while making structures

sykhro avatar Mar 25 '21 19:03 sykhro

I found out that it will only pay attention to the first valid key of the field but ignore the others.

seq:
  - id: first_byte
    type: u1
    valid:
      eq: 0x00
    # ignores this:
    valid: 0x01
    # ...
    valid:
      expr: _ >= clip_bottom
      # ignores these:
      expr: _ <= clip_top
      expr: _ == 0x7 or _ == 0x17 or _ == 0x23 or _ == 0x68 or _ == 0xB5

A reasonable behaviour would be a logical AND of all valid expressions.

It strictly separates min/max, eq, any-of and expr so that combinations of these are not accepted. The error message might be confusing, stating that the key is unknown. Writing it like this is not possible:

seq:
  - id: clip_bottom
    type: u1
  - id: clip_top
    type: u1
  - id: byte
    type: u1
    valid:
      min: clip_bottom
      max: clip_top
      any-of: [0x7, 0x17, 0x23, 0x68, 0xB5]

If this would work it would be a fair solution:

    # ...
    valid:
      min: clip_bottom
      max: clip_top
    valid:
      any-of: [0x7, 0x17, 0x23, 0x68, 0xB5]

The only combination, which works at the moment, is one-time min + max. Multiple min or max are ignored. A useful behaviour would be an n-ary min and max. But if these would be ANDed instead, multiple min would validate an infimum and multiple max a supremum.

This is not a pressing issue. There is no functional limitation. One could write the set of validation formulas as a single formula:

    # ...
    valid:
      expr: |
        _ >= clip_bottom and
        _ <= clip_top and
        (_ == 0x7 or _ == 0x17 or _ == 0x23 or _ == 0x68 or _ == 0xB5)

krisutofu avatar Jan 02 '22 01:01 krisutofu

I found out that it will only pay attention to the first valid key of the field but ignore the others.

.ksy is the file extension, where the letter y stands for YAML. What is placed into ksy files is a subset of YAML. YAML is the language to serialize and deserialize possibly nested data like in JSON.

in JavaScript it'd look like


let field = {};
....
field.valid = {eq: 0x00};
field.valid = 0x01;

As you see, the latter assignments override the former. When you store data it is clear that this must never happen. Different YAML parsers react differently on that. Some raise an error. Other ones just take one of the values and silently ignore the rest.

So your mental model is wrong, repeated keys don't mean what you assume them to mean.

KOLANICH avatar Jan 02 '22 05:01 KOLANICH

... in JavaScript it'd look like

let field = {};
....
field.valid = {eq: 0x00};
field.valid = 0x01;

As you see, the latter assignments overide the former. When you store data it is clear that this must never happen. Different YAML parsers react differently on that. Some raise an error. Other ones just take one of values and silently ignore the rest.

So your mental model is wrong, repeated keys don't mean what you assume them to mean.

Sorry, thank you for clarifying.

krisutofu avatar Jan 02 '22 17:01 krisutofu

Considering that we can assign contents: ["some string", 0x30, 30] shouldn't we also be able to assign

seq:
  - id: sig
    size: : 6
    valid:
      any-of: [ [ "AA", 0, 0, 1, 0 ] , [ "BB", 0, 0, 1, 0 ] ]

currently the compiler throws an error as it expects "string"? I assume that means KSY expressions so I tried [ '["AA", 0, 0, 1, 0]' , '["BB", 0, 0, 1, 0]' ] to no avail... can't combine output types: CalcStrType vs Int1Type(true)

Heck even the painful approach doesn't work:

seq:
  - id: sig1
    type: str
    size: : 2
    valid:
      any-of: [ "AA", "BB" ]
  - id: sig2
    contents: [ 0, 0, 1, 0 ]

KSC seems to evaluate AA and BB as KSY expressions which makes it search for ids matching those values but an id can only have lower case letters hence an error is thrown...

Omar-Abdul-Azeez avatar Jan 14 '23 12:01 Omar-Abdul-Azeez

I accidentally discovered that this was a thing. When will the relevant keywords be added to the docs and the YAML schema (the kaitai-struct-vscode plugin uses it).

demberto avatar Apr 12 '23 22:04 demberto

I accidentally discovered that this was a thing. When will the relevant keywords be added to the docs and the YAML schema (the kaitai-struct-vscode plugin uses it).

Which version of the VSCode extension do you have? I have v0.8.1 and it does not have the valid-expressions included. They are underlined red in my version. It seems to be outdated sinde it doesn't know the bit-endian meta property.

A year ago, I tried to contribute documentation about the valid property but it was entirely ignored. Here it is: valid expression documentation (Jan, 2022)

krisutofu avatar Apr 13 '23 11:04 krisutofu

Which version of the VSCode extension do you have? I have v0.8.1

Same.

demberto avatar Apr 13 '23 22:04 demberto