kaitai_struct
kaitai_struct copied to clipboard
Consistently handling user types and built-in sized types
Summary
Trying to formalize the way how size-related properties should model a Kaitai Struct type, I found some inconsistences. In the spoilers below you can see a diagrams that shown how each part of a binary stream is used by an attribute depending on which how that attribute is defined:
`size-eos: true`
The size of an attribute is defined by a size-eos: true
key:
-
# /==\ - type # Stream: [____ ] # 0 eos # \_______available for type_____/ # \_______consumed bytes_______/ # \_____________stream size__/ size-eos: true
-
# /==\ - type # Stream: [____ x ] # 0 terminator eos # \_available for type_/ / # \_______consumed bytes_______/ # \_____________stream size__/ size-eos: true terminator: x
-
# /==\ - type /========\ - padding # Stream: [____ ~~~~~~~~~~] # 0 eos # \_available for type_/ / # \_______consumed bytes_______/ # \_____________stream size__/ size-eos: true pad-right: ~ # terminator: ~ - implicit
-
# /==\ - type /=======\ - padding # Stream: [____ x~~~~~~~~~] # 0 eos # \_available for type_/ / # \_______consumed bytes_______/ # \_____________stream size__/ size-eos: true terminator: x pad-right: ~
`size`
The size of an attribute is defined by a size
key:
-
# /==\ - type # Stream: [____ ] | # 0 eos # \____available for type___/ / # \_____consumed bytes____/ / # \_____________stream size__/ size: s
-
# /==\ - type # Stream: [____ x ] | # 0 terminator eos # \_available for type_/ / / # \_____consumed bytes____/ / # \_____________stream size__/ size: s terminator: x
-
# /==\ - type /===\ - padding # Stream: [____ ~~~~~] | # 0 eos # \_available for type_/ / / # \_____consumed bytes____/ / # \_____________stream size__/ size: s pad-right: ~ # terminator: ~ - implicit
-
# /==\ - type /==\ - padding # Stream: [____ x~~~~] | # 0 terminator eos # \_available for type_/ / / # \_____consumed bytes____/ / # \_____________stream size__/ size: s terminator: x pad-right: ~
No `size` or `size-eos: true`
The size of an attribute is defined implicitly by absence of any size-related keys:
-
# /==\ - type # Stream: [____ ] | # 0 terminator eos # \_available for type_/ / # \__consumed bytes__/ / # \_____________stream size__/ terminator: x
-
pad-right
implicitly definesterminator
but this is lead to an error:# (main): /seq/0: error: 'size', 'size-eos' or 'terminator' must be specified # /==\ - type /===\ - padding # Stream: [____ ~~~~~] | # 0 eos # \_available for type_/ / / # \_____consumed bytes____/ / # \_____________stream size__/ pad-right: ~ # terminator: ~ - implicit
-
pad-right
is useless, but this is not forbidden:# /==\ - type # Stream: [____ ] | # 0 terminator eos # \_available for type_/ / # \__consumed bytes__/ / # \_____________stream size__/ terminator: x pad-right: ~
Problem 1 -- size
and size-eos
handled inconsistently depending on the type kind
I found, that size
and size-eos
keys won't work with built-in types with known size (u1
, etc.):
- id: field
type: u4 # and other sized built-ins
# Compile-time error
# io.kaitai.struct.format.YAMLParseException: /seq/0/size: unknown key found, expected: consume, doc, doc-ref, eos-error, id, if, include, repeat, terminator, type, valid
# size: 10
# Compile-time error
# size-eos: true
At the same time you can specify terminator
key, but it seams it has no meaning for that case:
- id: field
# result: 0x504b0304
type: u4be # and other sized built-ins
# on input [50 4b 03 04]
terminator: 3
Inconsistency 2 -- pad-right
can not be used alone
When pad-right: x
used together with the size
or size-eos: true
, it is implicitly defines a terminator: x
if the value of the terminator is not defined explicitly or implicitly by strz
.
size-eos: true
pad-right: y
# terminator: y - implicit
size: x
pad-right: y
# terminator: y - implicit
Also, it is possible to define only a terminator
without size
or size-eos: true
:
terminator: x
terminator: x
pad-right: y
At the same time defining only the pad-right
is an error:
# (main): /seq/0: error: 'size', 'size-eos' or 'terminator' must be specified
pad-right: ~
# terminator: ~ - implicit
Solution
I think, it will more consistent, if size-related keys will be independent from types. Each kaitai stream must have 4 properties:
- start offset
- consumed size
- availible size for sub-streams. That size is different from "consumed size" when terminator and/or pad-right is used. Technically more convenient to store as an end offset
- local current offset -- used as base for
pos
keys
At start root stream have:
- start offset = 0
- local current offset = start offset
- consumed size = undefined
- end offset = start offset + data size
Parsing will run as follows (current_offset
-- global offset in parsed data):
- Create new sub-stream of specified size
-
start_offset = current_offset
-
end_offset = parent_stream.end_offset
-
local_current_offset = parent_stream.local_current_offset
-
consumed_size = undefined
- if
size
specified,consumed_size = <size>, end_offset = start_offset + <size>, local_current_offset = 0
- if
size-eos: true
specified,local_current_offset = 0
-
- If
terminator
specified, find firstterminator
in sub-stream, and limit stream availible size-
local_current_offset = 0
- if terminator NOT found
- if
eos-error: true
raise error, stop - if
consumed_size
is not defined,consumed_size = end_offset - start_offset
- if
- if terminator found
-
end_offset = <position of terminator>
- if
include: true
specified,end_offset = end_offset + 1
- if
consumed_size
is not defined-
consumed_size = <position of terminator> - start_offset
- if
consume: true
specified,consumed_size = consumed_size + 1
-
-
-
- If terminator NOT found and
pad-right
specified, find firstpad-right
in sub-stream, and limit stream availible size- if padding byte(s) found,
end_offset = <position of start padding>
- if padding byte(s) found,
- Read specified type, using
[start_offset, end_offset)
slice as source of data- if type is built-in type, call corresponding built-in function
- fixed sized types read only necessary part of buffer, other is unused
- variable sized types (
str
,strz
andbytes
) use entire buffer as result
- if type is user type, recursively repeat this algorithm for all fields of user type
- if type is opaque type, call read function of opaque type
- if type is built-in type, call corresponding built-in function
- If
consumed_size
is not defined, setconsumed_size = <readed size>
-
current_offset = start_offset + consumed_size
That algorithm also fixes #783
Allowing size to built-in sized types also more naturally handle such situations:
meta:
id: size_problems
endian: be
encoding: utf-8
seq:
- id: field
size: 5
type:
switch-on: 2
cases:
1: u4
2: str
Depending on switch-on
result size
member is taken into account or not. I think that is not good. size
should define size of field, not the size of type.
In the example posted in your follow-up reply above, your two cases are a five-character string and a four-byte integer presumably with a padding byte: it would be best to handle them as separate, explicit "types", rather than trying to claim that the field is five-bytes thereby creating a new problem in which we try to figure out how to extract an integer from an over-sized field (i.e., how is KSC supposed to know which four of the five bytes to use for the integer (the answer might change with endianness...) and what is it supposed to do with the extra byte?).
My instinct here is that the size
attribute is implicit when applied to a field with a built-in type. I don't see any reason why one shouldn't be allowed to specify it explicitly (other than product implementation, documentation, and support costs....), but, if one does specify it, it must match, and so your example of a size: 5
field containing a u4
should result in an error.
This is just a minimal example showing the problem, assume that it will look like this:
meta:
id: size_problems
endian: be
encoding: utf-8
seq:
- id: field
size: 5
type:
switch-on: 2
cases:
1: U4
2: str
types:
U4:
seq:
- id: field
size: 4
Now it no longer raises these questions, because everything works as described in the documentation, right?
creating a new problem in which we try to figure out how to extract an integer from an over-sized field
This problem never existed, you don't complain about which bytes to use for type U4
in my last example? Why you should complain about u4
? Just because it is built-in type?
(i.e., how is KSC supposed to know which four of the five bytes to use for the integer (the answer might change with endianness...) and what is it supposed to do with the extra byte?)
Endianness do not matter here. First 4 bytes will be used for decoding type, last byte is garbage
My instinct here is that the size attribute is implicit when applied to a field with a built-in type. I don't see any reason why one shouldn't be allowed to specify it explicitly (other than product implementation, documentation, and support costs....), but, if one does specify it, it must match, and so your example of a size: 5 field containing a u4 should result in an error.
If absence of size
for naturally sized types will just take natural size of type, then switch-on
constructions with types of different sizes will not work. For example, that breaks almost any T(L)V implementation.
rather than trying to claim that the field is five-bytes
Just on the contrary, it is most obvious to say that size refers to the size of the field, and not to the size of the type. For user types, this is the case.