ion-docs icon indicating copy to clipboard operation
ion-docs copied to clipboard

Clarify status of repeated fields in sorted binary structs

Open PeytonT opened this issue 4 years ago • 2 comments

Bear with me, I'm probably almost out of questions about specification ambiguities.

The specification describes that binary structs can be ordered.

When L is 1, the struct has at least one symbol/value pair, the length field exists, and the field name integers are sorted in increasing order. http://amzn.github.io/ion-docs/docs/binary.html#13-struct

This is not in any way optional, so it should be an error if L is 1 and the field name integers are not sorted in increasing order.

The spec says "increasing order", as opposed to "non-decreasing" order. But it does not say "monotonically increasing order". It's not clear if "increasing" should be interpreted as non-decreasing or as monotonically increasing. The distinction determines whether repeated fields prohibited in sorted structs.

My impression is that the specification of the struct data model hints that they are prohibited, but without clearly stating it.

When two fields in the same struct have the same name we say there are “repeated names” or (somewhat misleadingly) “repeated fields”. Implementations must preserve all such fields, i.e., they may not discard fields that have repeated names. However, implementations may reorder fields (the binary format identifies structs that are sorted by symbolID), so certain operations may lead to nondeterministic behavior. http://amzn.github.io/ion-docs/docs/spec.html#struct

The parenthetical appearing where it does suggests that it is an exception to "However, implementations may reorder fields". But if sorted structs can have repeated fields, repeated fields can be reordered while preserving that the field name integers are sorted. Thus, certain operations on sorted structs can still lead to nondeterministic behavior, and the parenthetical feels misleading.

But the recommendation to use symbol ID zero for NOP padding does not indicate that including multiple NOP pads with symbol ID zero in a sorted struct would be an error, which suggests otherwise. (Though this raises the question of whether NOP padding field names should be ignored before or after asserting that the field name integers are sorted in increasing order.)

NOP Padding in struct values requires additional consideration of the field name element. If the “value” of a struct field is the NOP pad, then the field name is ignored. This means that it is not possible to encode padding in a struct value that is less than two bytes. Implementations should use symbol ID zero as the field name to emphasize the lack of meaning of the field name. http://amzn.github.io/ion-docs/docs/binary.html#13-struct

What was the intended behavior?

PeytonT avatar Mar 10 '20 05:03 PeytonT

I don't think the spec intends to prohibit repeated fields in ordered structs. I believe the second passage you quoted can be summarized as "Repeated fields must not be discarded, but they may be reordered."

It also doesn't seem like we need any special cases for NOP padding in sorted structs.

Do you have a suggestion for how to make the spec language regarding sorted structs more clear?

tgregg avatar Mar 11 '20 19:03 tgregg

I suggest removing the parenthetical, since it does not qualify or clarify any of the statements in the section.

I think the spec would be perfectly clear were it written:

When two fields in the same struct have the same name we say there are “repeated names” or (somewhat misleadingly) “repeated fields”. Implementations must preserve all such fields, i.e., they may not discard fields that have repeated names. However, implementations may reorder fields, so certain operations may lead to nondeterministic behavior.

Do you have a suggestion for how to make the spec language regarding sorted structs more clear?

With this cleared up I think the spec language about the behavior of sorted structs is clear. But a justification for sorted structs would be helpful, since as far as I can tell they do not serve any purpose.

The specification begins by establishing that

The Amazon Ion specification has three parts:

  • A set of data types
  • A textual notation for values of those types
  • A binary notation for values of those types

All three views are semantically isomorphic, meaning they can represent exactly the same data structures, and an Ion processor can transcode between the formats without loss of data.

In the data model structs are specifically unordered, and since it is impossible for the binary representation to encode semantic information not present in the data model, a compliant parser is free to reorder the fields of a sorted binary struct when reading it into the data model. Moreover, an implementation should not contain any mechanism within the data model representation to distinguish a sorted from an unsorted struct, since no distinction exists.

So why do binary sorted structs exist at all? My best guess is that ensuring a linear memory access pattern during symbol resolution would give improved cache performance due to prefetching, but this would only be relevant for relatively small stride sizes on a local symbol table with a massive number of symbols.

PeytonT avatar Mar 12 '20 03:03 PeytonT