kaitai_struct
kaitai_struct copied to clipboard
Allow repeat: until to optionally exclude the last non-matching element (look-ahead feature)
Currently repeat: until
will always read and include the non-matching element. It would be preferable if this behaviour could be toggled (with consume: false
?) so that it is possible to perform a look-ahead on future bytes in the stream. Because an attempt would need to be made to read the element following the last valid element matching the repeat-until
condition, it is possible that errors may be encountered. In such a case the behaviour should be to catch and ignore look-ahead reading errors, and reset the stream position back to the last known element which was successfully read and passed the repeat-until
condition.
An example of where this feature could be helpful is in parsing the JPEG Interchange Format (see ITU-T T.81 at https://www.w3.org/Graphics/JPEG/itu-t81.pdf). In the non-hierarchical variant of this format, any number of table or misc segment
elements can occur in any order prior to a mandatory frame header
element (see figure B.2/page 34 of ITU-T T.81).
I have managed to implement a workaround, demonstrated below (non-essential parts removed):
types:
frame:
seq:
- id: table_or_misc_segments
type: table_or_misc_segment
repeat: until
repeat-until: >-
_.next_marker != marker_codes::define_quantization_tables and
_.next_marker != marker_codes::huffman_table_specification and
_.next_marker != marker_codes::arithmetic_coding_conditioning_specification and
_.next_marker != marker_codes::define_restart_interval and
_.next_marker != marker_codes::comment and
_.next_marker != marker_codes::application_segment_0 and
_.next_marker != marker_codes::application_segment_1 and
_.next_marker != marker_codes::application_segment_2 and
_.next_marker != marker_codes::application_segment_3 and
_.next_marker != marker_codes::application_segment_4 and
_.next_marker != marker_codes::application_segment_5 and
_.next_marker != marker_codes::application_segment_6 and
_.next_marker != marker_codes::application_segment_7 and
_.next_marker != marker_codes::application_segment_8 and
_.next_marker != marker_codes::application_segment_9 and
_.next_marker != marker_codes::application_segment_10 and
_.next_marker != marker_codes::application_segment_11 and
_.next_marker != marker_codes::application_segment_12 and
_.next_marker != marker_codes::application_segment_13 and
_.next_marker != marker_codes::application_segment_14 and
_.next_marker != marker_codes::application_segment_15
types:
table_or_misc_segment:
seq:
- id: marker
type: u2
enum: marker_codes
- id: length
type: u2
- id: data
size: length - 2
type:
switch-on: marker
cases:
#...
instances:
next_marker:
pos: _io.pos
type: u2
enum: marker_codes
It seems that the next_marker
instance is calculated after the table_or_misc_segment
sequence has been fully parsed, thus pos: _io.pos
returns the position within the file of the next byte beyond the table_or_misc_segment
sequence.
Is this workaround suitable, or just a dodgy hack that will break in various translators?
There are several independent issues here.
-
Is it possible to omit inclusion of the last read element in the resulting array? Yes, it's certainly possible and the change is pretty much trivial. However, I'm not sure if it's all that useful per se: one can almost always skip processing terminating element on client's side, and, given that almost have no array operations in expression language, it is not of a big concern for anything related to processing that might happen in expression language inside the ksy.
-
Can we backtrack pointer after reading last element, which happened to be not what we were looking for? Probably that should be not too hard to implement, i.e. we generate loops like these:
bool ok;
do {
pos = this._io.pos();
_it = new Element(this._io, this, _root);
ok = _it.marker() != 0; // or some other condition
if (ok) {
this.elements.add(_it);
} else {
this._io.seek(pos);
}
} while (ok);
- Backtracking "after getting an error" is somewhat more difficult. First of all, there is no universal concept of "error" in KS. There is
contents
thing for magic validation, there is read-past-end-of-stream exception, and there are several proposals (like #81) to add validation expression, but they are not set in stone and/or implemented yet. I believe that without getting that concept solid, we can't argue about implementing backtracking-after-error.
The standard approach to what you've demonstrated is something along the lines of:
seq:
- id: elements
type: element
repeat: until
repeat-until: not _.is_valid_marker
types:
element:
seq:
- id: marker
type: u2
enum: # ...
- id: length
type: u2
if: is_valid_marker
- id: body
size: length - 2
type:
switch-on: marker
cases:
# ...
if: is_valid_marker
instances:
is_valid_marker:
value:
# do some check for marker validity, like tons of comparisons or
# something fancier
However, it consumes extra 2 bytes of the stream as "next marker". Your approach avoids that, but, yeah, it feels pretty hacky. I believe it should work in all supported languages the same way, but it would very hard to use this ksy to generate writer code.
Probably it's easy to implement something like the code suggested in (2) using a syntax like:
seq:
- id: elements
type: element
repeat: while
repeat-while: _.marker != 0
I guess it should be more or less understandable and expected by the majority of users. Would it be ok with you?
repeat: while
would be perfect for this situation. I don't think option (3) would work because the next marker would be read into elements.last
. The next item in the sequence would therefore not be starting on the next marker.
Just to clarify with example (2),
_it = new Element(this._io, this, _root);
should not raise an exception (if there aren't enough bytes in the stream to read, if a magic number doesn't match, etc). Below is a rough example of desired output code:
bool ok;
do {
pos = this._io.pos();
ok = false;
try {
_it = new Element(this._io, this, _root);
if (_it.marker() != 0) { // or some other condition
ok = true;
}
} catch (...) {} //all exceptions which need to be caught
if (ok) {
this.elements.add(_it);
} else {
this._io.seek(pos);
}
} while (ok);
A read of Element
may be unsuccessful due to the next element being of a different type with a different magic number, different length than Element
, etc.
It's a bad idea to catch all exceptions, especially in low-level-sensitive languages such as C++. I've already mentioned as (3) that a concept of "erroneously read element" is next to non-existent now in KS. For example, typecasting (foo.as<bar>
) of incompatible type will raise some sort of cast exception in most languages, but not in C++. Division by zero is exception in Java, Ruby and Python, Infinity in JavaScript, and abortion with a signal in C++ (platform-dependent). Etc, etc.
So, if you want error catching and ignoring as well, we need to:
- Develop a more or less viable system of exceptions.
- Add some way to define that exception catching behavior. Probably it should be the same for all types (or at least all user types) parsing, something akin to
eos-error: false
. - Do lots of tests and define behavior in complex scenarios, such as ones with buffered element (i.e. with
size
) +process
, etc, etc.
For practical purposes, I guess, it's much easier (and cleaner) to just define element
type in a way that it will not break parsing with exception, but rather will result in mostly nulled structure.
I'm happy with the suggested approach of implementing exceptions and allowing exceptions to be caught/ignored. I realise this is no easy task though!
As an interim solution, I was thinking something like the following could be done:
element_of_unknown_type:
seq:
- id: element
type:
switch-on: first_byte
cases:
0: element_of_type_a
1: element_of_type_b
2: element_of_type_c
instances:
first_byte:
pos: 0
type: u1
In order for this structure to be used successfully in cases where the size of element_of_unknown_type
is unknown until it is read fully, a new mechanism would be required to create substreams of unknown size (allowing method pos
to be used). Is this a possibility? Alternatively, methods pos
and size
could be added to all elements in the tree, allowing instances
to read data from the stream before seq
attempts to do so (by use of if
/switch-on
/etc with seq
).
Hi, is anybody working on this enhancement?
Not at this moment. Probably we'll need to implement some kind of proper exception systems first.
The absence of this feature is a blocker for me and anybody who deal with variable number of blocks that:
- have a clear structure that starts with some marker (so I can at least try to distinguish it from something else)
- have different actual size (so I cannot predict anything without looking ahead)
- have no terminator (so I cannot rely on "terminator" feature)
I would rather deal with guessing is this instance of _ suitable or not than have no ability to do it at all (
Another case where this is needed from a file I'm trying to parse: I have an array of my_type
that is terminated by the value 0xFFFFFFFF
. I am not aware of any non-stupid method that would at least prevent me from reading the next 4 bytes after the array terminator. I find the fact that a similar feature exists for strings (consume: false), this exact inability of the language is explicitly mentioned at section 4.10.3 of the user's guide, and that this issue is still open from 2017, a little ridiculous
types:
my_type:
seq:
- id: item1
type: u2
- id: item2
type: u2
- id: item3
type: u2
- id: item4
type: u2
CC @qequ
I'm looking for this feature as well. I'm working on specific format using "null" entry as a separator. Following specification parses format properly, but it includes the null entry in entries. I would like to have entries and "null entry" separated if possible.
seq:
- id: entries
type: entry
repeat: until
repeat-until: _.file_name == ""
types:
entry:
seq:
- id: file_name
type: strz
encoding: ASCII
- id: mime_type
size: 4
- id: original_size
type: u4
- id: offset
type: u4
- id: timestamp
type: u4
- id: datasize
type: u4
@davidhicks thanks for the tip, I was able to bypass my problem with looking forward using instance
.
I'm guessing this would solve my use-case too. I am currently unable to parse this:
f1 f4 41 20 73 65 6e 74 65 6e 63 65 3a f4 41 6e ..A sentence:.An
6f 74 68 65 72 20 73 65 6e 74 65 6e 63 65 2e f1 other sentence..
f4 50 61 72 61 67 72 61 70 68 20 32 f0 01 02 03 .Paragraph 2....
f1
represents the start of a paragraph and in reality has some more fields before the text-segments start.
f4
represents the start of a text-segment and in reality has some more fields like font color and style.
f0
represents the end of a page
The desired result would be something like:
{
"pages": [
{
"paragraphs": [
{
"textSegments": [
{
"text": "A sentence:"
},
{
"text": "Another sentence."
}
]
},
{
"textSegments": [
{
"text": "Paragraph 2"
}
]
}
]
}
]
}
The issue for Kaitai is that the strings have no terminator and no length. And the amounts of paragraphs and textSegments within them is not specified anywhere.
In short:
- I need to repeat paragraphs until a f0 is encountered
- within each paragraph I need to repeat textSegments until either an
f4
or anf0
is encountered
In case it might help someone, here is a workaround i used for this enhanced PE parser i'm working on, based on the Kaitai one.
NOTE: It will work only if the size of the elements in the repeat block is fixed (no dynamically sized elements) It's based on this workaround by @generalmimon
import_descriptor_reader:
seq:
- id: invoke_items_start
size: 0
if: items_start >= 0
- id: buffer
type: import_descriptor
repeat: until
# will read until this condition, including. the instances block below will take care of skipping this item
repeat-until: _.name.value == 0 and _.first_thunk.value == 0
size: sizeof<import_descriptor>
- id: invoke_items_end
size: 0
if: items_end >= 0
instances:
# captures the data starting position
items_start:
value: _io.pos
# captures the data ending position
items_end:
value: _io.pos
# counts how many bytes we read, and excludes the last item
items_size:
value: items_end - sizeof<import_descriptor> - items_start
# converts the size into number of elements
items_count:
value: items_size / sizeof<import_descriptor>
# re-reads the items, this time without the last item
items:
pos: items_start
repeat: expr
repeat-expr: items_count
type: import_descriptor
size: sizeof<import_descriptor>
I'm using a similar variation of this workaround to answer the question "find the first element of an array that satisfies a condition", in order to find the PE section that contains the given virtual address.