kaitai_struct icon indicating copy to clipboard operation
kaitai_struct copied to clipboard

Consistently handling user types and built-in sized types

Open Mingun opened this issue 3 years ago • 3 comments

Summary

Trying to formalize the way how size-related properties should model a Kaitai Struct type, I found some inconsistences. In the spoilers below you can see a diagrams that shown how each part of a binary stream is used by an attribute depending on which how that attribute is defined:

`size-eos: true`

The size of an attribute is defined by a size-eos: true key:

  • #          /==\ - type
    # Stream: [____                            ]
    #         0                                eos
    #          \_______available for type_____/
    #           \_______consumed bytes_______/
    #            \_____________stream size__/
    size-eos: true
    
  • #          /==\ - type
    # Stream: [____                  x         ]
    #         0                  terminator    eos
    #          \_available for type_/         /
    #           \_______consumed bytes_______/
    #            \_____________stream size__/
    size-eos: true
    terminator: x
    
  • #          /==\ - type           /========\ - padding
    # Stream: [____                  ~~~~~~~~~~]
    #         0                                eos
    #          \_available for type_/         /
    #           \_______consumed bytes_______/
    #            \_____________stream size__/
    size-eos: true
    pad-right: ~
    # terminator: ~ - implicit
    
  • #          /==\ - type            /=======\ - padding
    # Stream: [____                  x~~~~~~~~~]
    #         0                                eos
    #          \_available for type_/         /
    #           \_______consumed bytes_______/
    #            \_____________stream size__/
    size-eos: true
    terminator: x
    pad-right: ~
    
`size`

The size of an attribute is defined by a size key:

  • #          /==\ - type
    # Stream: [____                       ]    |
    #         0                                eos
    #          \____available for type___/    /
    #           \_____consumed bytes____/    /
    #            \_____________stream size__/
    size: s
    
  • #          /==\ - type
    # Stream: [____                  x    ]    |
    #         0                  terminator    eos
    #          \_available for type_/    /    /
    #           \_____consumed bytes____/    /
    #            \_____________stream size__/
    size: s
    terminator: x
    
  • #          /==\ - type           /===\ - padding
    # Stream: [____                  ~~~~~]    |
    #         0                                eos
    #          \_available for type_/    /    /
    #           \_____consumed bytes____/    /
    #            \_____________stream size__/
    size: s
    pad-right: ~
    # terminator: ~ - implicit
    
  • #          /==\ - type            /==\ - padding
    # Stream: [____                  x~~~~]    |
    #         0                  terminator    eos
    #          \_available for type_/    /    /
    #           \_____consumed bytes____/    /
    #            \_____________stream size__/
    size: s
    terminator: x
    pad-right: ~
    
No `size` or `size-eos: true`

The size of an attribute is defined implicitly by absence of any size-related keys:

  • #          /==\ - type
    # Stream: [____                  ]         |
    #         0                  terminator    eos
    #          \_available for type_/         /
    #           \__consumed bytes__/         /
    #            \_____________stream size__/
    terminator: x
    
  • pad-right implicitly defines terminator but this is lead to an error:
    # (main): /seq/0: error: 'size', 'size-eos' or 'terminator' must be specified
    #          /==\ - type           /===\ - padding
    # Stream: [____                  ~~~~~]    |
    #         0                                eos
    #          \_available for type_/    /    /
    #           \_____consumed bytes____/    /
    #            \_____________stream size__/
    pad-right: ~
    # terminator: ~ - implicit
    
  • pad-right is useless, but this is not forbidden:
    #          /==\ - type
    # Stream: [____                  ]         |
    #         0                  terminator    eos
    #          \_available for type_/         /
    #           \__consumed bytes__/         /
    #            \_____________stream size__/
    terminator: x
    pad-right: ~
    

Problem 1 -- size and size-eos handled inconsistently depending on the type kind

I found, that size and size-eos keys won't work with built-in types with known size (u1, etc.):

- id: field
  type: u4 # and other sized built-ins
  # Compile-time error
  # io.kaitai.struct.format.YAMLParseException: /seq/0/size: unknown key found, expected: consume, doc, doc-ref, eos-error, id, if, include, repeat, terminator, type, valid
  # size: 10
  # Compile-time error
  # size-eos: true

At the same time you can specify terminator key, but it seams it has no meaning for that case:

- id: field
  # result: 0x504b0304
  type: u4be # and other sized built-ins
  # on input [50 4b 03 04]
  terminator: 3

Inconsistency 2 -- pad-right can not be used alone

When pad-right: x used together with the size or size-eos: true, it is implicitly defines a terminator: x if the value of the terminator is not defined explicitly or implicitly by strz.

size-eos: true
pad-right: y
# terminator: y - implicit
size: x
pad-right: y
# terminator: y - implicit

Also, it is possible to define only a terminator without size or size-eos: true:

terminator: x
terminator: x
pad-right: y

At the same time defining only the pad-right is an error:

# (main): /seq/0: error: 'size', 'size-eos' or 'terminator' must be specified
pad-right: ~
# terminator: ~ - implicit

Solution

I think, it will more consistent, if size-related keys will be independent from types. Each kaitai stream must have 4 properties:

  • start offset
  • consumed size
  • availible size for sub-streams. That size is different from "consumed size" when terminator and/or pad-right is used. Technically more convenient to store as an end offset
  • local current offset -- used as base for pos keys

At start root stream have:

  • start offset = 0
  • local current offset = start offset
  • consumed size = undefined
  • end offset = start offset + data size

Parsing will run as follows (current_offset -- global offset in parsed data):

  1. Create new sub-stream of specified size
    • start_offset = current_offset
    • end_offset = parent_stream.end_offset
    • local_current_offset = parent_stream.local_current_offset
    • consumed_size = undefined
    • if size specified, consumed_size = <size>, end_offset = start_offset + <size>, local_current_offset = 0
    • if size-eos: true specified, local_current_offset = 0
  2. If terminator specified, find first terminator in sub-stream, and limit stream availible size
    • local_current_offset = 0
    • if terminator NOT found
      • if eos-error: true raise error, stop
      • if consumed_size is not defined, consumed_size = end_offset - start_offset
    • if terminator found
      • end_offset = <position of terminator>
      • if include: true specified, end_offset = end_offset + 1
      • if consumed_size is not defined
        • consumed_size = <position of terminator> - start_offset
        • if consume: true specified, consumed_size = consumed_size + 1
  3. If terminator NOT found and pad-right specified, find first pad-right in sub-stream, and limit stream availible size
    • if padding byte(s) found, end_offset = <position of start padding>
  4. Read specified type, using [start_offset, end_offset) slice as source of data
    • if type is built-in type, call corresponding built-in function
      • fixed sized types read only necessary part of buffer, other is unused
      • variable sized types (str, strz and bytes) use entire buffer as result
    • if type is user type, recursively repeat this algorithm for all fields of user type
    • if type is opaque type, call read function of opaque type
  5. If consumed_size is not defined, set consumed_size = <readed size>
  6. current_offset = start_offset + consumed_size

That algorithm also fixes #783

Mingun avatar Jul 29 '20 18:07 Mingun

Allowing size to built-in sized types also more naturally handle such situations:

meta:
  id: size_problems
  endian: be
  encoding: utf-8
seq:
  - id: field
    size: 5
    type:
      switch-on: 2
      cases:
        1: u4
        2: str

Depending on switch-on result size member is taken into account or not. I think that is not good. size should define size of field, not the size of type.

Mingun avatar Jul 30 '20 07:07 Mingun

In the example posted in your follow-up reply above, your two cases are a five-character string and a four-byte integer presumably with a padding byte: it would be best to handle them as separate, explicit "types", rather than trying to claim that the field is five-bytes thereby creating a new problem in which we try to figure out how to extract an integer from an over-sized field (i.e., how is KSC supposed to know which four of the five bytes to use for the integer (the answer might change with endianness...) and what is it supposed to do with the extra byte?).

My instinct here is that the size attribute is implicit when applied to a field with a built-in type. I don't see any reason why one shouldn't be allowed to specify it explicitly (other than product implementation, documentation, and support costs....), but, if one does specify it, it must match, and so your example of a size: 5 field containing a u4 should result in an error.

webbnh avatar Jul 30 '20 16:07 webbnh

This is just a minimal example showing the problem, assume that it will look like this:

meta:
  id: size_problems
  endian: be
  encoding: utf-8
seq:
  - id: field
    size: 5
    type:
      switch-on: 2
      cases:
        1: U4
        2: str
types:
  U4:
    seq:
      - id: field
        size: 4

Now it no longer raises these questions, because everything works as described in the documentation, right?

creating a new problem in which we try to figure out how to extract an integer from an over-sized field

This problem never existed, you don't complain about which bytes to use for type U4 in my last example? Why you should complain about u4? Just because it is built-in type?

(i.e., how is KSC supposed to know which four of the five bytes to use for the integer (the answer might change with endianness...) and what is it supposed to do with the extra byte?)

Endianness do not matter here. First 4 bytes will be used for decoding type, last byte is garbage

My instinct here is that the size attribute is implicit when applied to a field with a built-in type. I don't see any reason why one shouldn't be allowed to specify it explicitly (other than product implementation, documentation, and support costs....), but, if one does specify it, it must match, and so your example of a size: 5 field containing a u4 should result in an error.

If absence of size for naturally sized types will just take natural size of type, then switch-on constructions with types of different sizes will not work. For example, that breaks almost any T(L)V implementation.

rather than trying to claim that the field is five-bytes

Just on the contrary, it is most obvious to say that size refers to the size of the field, and not to the size of the type. For user types, this is the case.

Mingun avatar Jul 30 '20 16:07 Mingun