kaitai_struct_compiler icon indicating copy to clipboard operation
kaitai_struct_compiler copied to clipboard

Fixes/standardizes the behavior of a repetition '_index' within a 'repeat-until' expression

Open transverberate opened this issue 2 years ago • 7 comments

Defines/Standardizes the Behavior of a Repetition _index within a repeat-until Expression

NOTE: This PR standardizes the previously-undefined behavior of _index within repeat_until. Fixes issue #958. Because this standardizes something that was previously undefined, this PR may warrant further discussion beyond the scope of a typical code review.

Standardizing this Behavior

Need

Issue #958 presents a basic scenario which warrants using (and therefore defining the behavior of) _index within a repeat-until expression. Generally speaking, any data structure which contains a list of entries of a variable-size N that is capable of terminating early (before N items have been processed), benefits from this functionality.

Possible Behaviors

Inclusive _index ~~(Implemented in this PR)~~

'Inclusive' means that _index gives the total number of elements processed plus the current entry. Thus, for the first item processed in a repeat-until block, _index will yield a value of 1. This is in contrast to the typical evaluation of _index outside a repeat-until expression. ~~However, I believe this treatment of _index within a repeat-until expression is justified (and, in-general, the expected behavior) - see the Reasoning for Inclusive Evaluation section.~~

Exclusive _index

'Exclusive' means that _index gives the total number of elements processed without counting the current entry. Thus, for the first item processed in a repeat-until block, _index will yield a value of 0. This is the typical behavior of _index in expressions outside repeat-until blocks.

Current Behavior

Currently, _index within repeat-until blocks is evaluated differently depending on the language target, see issue #958.

  • C++: _index is inclusive

  • C#: _index is inclusive

  • Go: _index is inclusive both within and outside repeat-until expressions. This is unlike every other language target where expressions (outside repeat-until) treat _index as exclusive (as they should).

  • Java: _index is inclusive

  • Javascript: _index is inclusive

  • Lua: _index is exclusive

  • Nim: _index is exclusive

  • Perl: Use of _index within repeat-until expressions results in code that does not compile

  • PHP: _index is inclusive

  • Python: _index is exclusive

  • Ruby: _index is inclusive

  • Rust: Use of _index within repeat-until expressions results in code that does not compile

~~Reasoning for Inclusive Evaluation~~

NOTE: This PR now implements exclusive evaluation. See the discussion in the comments.

Typically, _index is evaluated exclusively. This way, _index gives the current index of the entry being processed. My argument, however, is that by the time the condition within a repeat-until expression is being evaluated, the current item is already considered to be processed.

This is supported by the fact that the contents of the current item can be referred to by the _ reference within a repeat-until expression - further reinforcing that the current item is already be considered to have been 'processed' when evaluating repeat-until expression.

Inclusive evaluation is also consistent with most implementations of a 'do while' loop, a structure to which a repeat-until block is roughly analogous.

Finally, inclusive evaluation will provide coverage for the most common use case: where the maximum number of repetitions (N) is known, but has the potential to terminate early. Exclusive evaluation would require that the condition _index >= N - 1 is tested, rather than the cleaner _index >= N.

transverberate avatar Apr 04 '22 08:04 transverberate

Hey, thanks for looking into this and writing a detailed analysis and fix! Officially supporting _index inside repeat-until makes sense, especially because in kaitai_struct_formats we already have two specs that use this feature. In fact, these two specs apparently already expect two different meanings for _index, so there's definitely a need to standardize this.

I don't quite follow the argument for the "inclusive" _index behavior though. You are right that

by the time the condition within a repeat-until expression is being evaluated, the current item is already considered to be processed

But the repeat-until condition is still evaluated in the context of that already processed item. In particular, the "current" variable _ refers to the item that was just parsed, and not the upcoming item that has not been parsed yet (obviously that would be quite useless). So I'd argue that _index should match that and should be the index of the just parsed item, not the next one. IMO it would be counterintuitive to have _index be something different than the index of _ in the array.

I do see the advantage that the "inclusive" behavior matches what you would get in a C-style loop, where the index is incremented before the loop condition is evaluated. This indeed also makes it easier to write a maximum array size in a repeat-until condition. But I'm not sure if it's worth making _ and _index inconsistent to achieve this.

dgelessus avatar Apr 04 '22 23:04 dgelessus

Great find on the two specs which employ this already!

I've been thinking about this some more... The repeat functionality is quite limited currently. If this functionality is improved in the future such that its possible to specify whether a repeat block is inclusive/exclusive and/or when the loop condition is evaluated (perhaps some sort of repeat-while?) there may be new, cleaner ways which achieve the desired behavior (that being, terminating a variable-length list early).

If this is the case, perhaps _index should remain exclusive to keep behavior consistent across expressions. This way, current developers have a (rather hacky) means of accomplishing this behavior, while leaving room for more elegant solutions to be delivered (perhaps ones that allow the user to specify exclusive/inclusive evaluation) in the future without breaking backward compatibility.

TLDR: Maybe _index should be exclusive. Whatever the result, the behavior of _index within repeat-until should be documented in the user guide (possibly alongside an example).

transverberate avatar Apr 05 '22 03:04 transverberate

I think that _index should be exclusive, and if one needs _index + 1 as a count of already processed elements, he should be able to use _count, and if that _count is used, KSC should emit the code computing it.

KOLANICH avatar Apr 05 '22 04:04 KOLANICH

I think that _index should be exclusive, and if one needs _index + 1 as a count of already processed elements, he should be able to use _count, and if that _count is used, KSC should emit the code computing it.

Introduction of a further parameter _count is beyond the scope of this PR (at least, in my opinion). In general, I'm skeptical of introducing another parameter whose definition is simply _index + 1. However, this is just my initial impression.

My suspicion is that many of the problems of the current representation could be addressed by introducing needed functionality into a repeat block (perhaps something as you have mentioned here).

I would love to see a discussion between the senior maintainers and the user-base addressing the limitations of the repeat functionality sometime in the future.

transverberate avatar Apr 07 '22 02:04 transverberate

@dgelessus:

But the repeat-until condition is still evaluated in the context of that already processed item. In particular, the "current" variable _ refers to the item that was just parsed, and not the upcoming item that has not been parsed yet (obviously that would be quite useless). So I'd argue that _index should match that and should be the index of the just parsed item, not the next one. IMO it would be counterintuitive to have _index be something different than the index of _ in the array.

100% agree. I also think that _index should be exclusive in repeat-until - in other words, _ should be the same as items[_index]. If the repeat-until condition also depends on a previous element for example, that will be items[_index - 1] (of course this can be used only for _index >= 1, because there is no previous element for _index == 0). (Needing previous item in repeat-until appeared on Gitter and I came up with the solution where items[_index - 2] is the previous element and items[_index - 1] the current one because that's how the JavaScript code backing the Web IDE behaved, but it's certainly confusing and counter-intuitive.)

With the "inclusive" _index, items[_index] is undefined, because it would reference the future element - if we choose this behavior, there's a question what _index actually stands for when it can't be used as a subscript of the array. Oh, it makes emulating repeat-expr more "elegant" just by using repeat-until: _index == num_items (instead of repeat-until: _index == num_items - 1 for exclusive _index)? Once again, this looks weird to me too, because intuitively _index is an index to the array (if _index isn't an index, I don't know what it is) - it should be always in range 0 <= _index <= num_items - 1, so _index == num_items doesn't make sense and should never happen.

If you want to emulate repeat-expr using repeat-until (which may be useful if the repetition can be terminated before reaching the desired count when a special condition is fulfilled, as you describe in https://github.com/kaitai-io/kaitai_struct/issues/958), you'll want to finish on the last element, which has the index num_items - 1, so repeat-until: _index >= num_items - 1 (or _index == num_items - 1 if you're sure that you won't somehow miss the specific index, which can be safely assumed here, because _index is always increased by 1 and the end condition is checked in each iteration).

Looking at your list in https://github.com/kaitai-io/kaitai_struct_compiler/pull/245#issue-1191448714, this decision may break existing specs relying on the inclusive behavior, but I don't think there are that many specs using _index in repeat-until at all, so it won't matter so much. And we inevitably break some compatibility just by standardizing previously various/undefined behavior, so why not choose the one that makes more sense and will be less confusing to users (in my opinion).

generalmimon avatar Apr 07 '22 08:04 generalmimon

@generalmimon

100% agree. I also think that _index should be exclusive in repeat-until - in other words, _ should be the same as items[_index]. If the repeat-until condition also depends on a previous element for example, that will be items[_index - 1] (of course this can be used only for _index >= 1, because there is no previous element for _index == 0). (Needing previous item in repeat-until appeared on Gitter and I came up with the solution where items[_index - 2] is the previous element and items[_index - 1] the current one because that's how the JavaScript code backing the Web IDE behaved, but it's certainly confusing and counter-intuitive.)

I failed to consider the more common use of _index within a repeat-until block (use as an actual index e.g., items[_index - 1]) in my initial reasoning.

I now agree exclusive evaluation makes more sense and modified the PR to implement this. More considerable alterations were needed to achieve this result so I'm in the process of setting up the test suite and confirming proper operation.

transverberate avatar Apr 08 '22 16:04 transverberate

#787 is related documenting issue

Mingun avatar Apr 16 '22 07:04 Mingun