ion-docs icon indicating copy to clipboard operation
ion-docs copied to clipboard

Specify handling of repeated field names in system Structs

Open PeytonT opened this issue 4 years ago • 8 comments

This came up while reviewing a pull request on serde_ion.

Background

Ion specifically allows for Struct values to contain duplicate field names. [1, 2]

But throughout the specification, sections are written as though this were not the case. There are a number of references to "the {field_name} field", and the specification provides for the scenarios where no field with a particular name exists, but not for the scenario where multiple exist.

The section for local symbol tables encountered in an Ion stream is one example.

  • The imports field should be the symbol $ion_symbol_table or a list as specified in the following section.
  • The symbols field should be a list of strings. If the field is missing or has any other type, it is treated as if it were an empty list.

As is the section for import structs in a local symbol table.

  • If no name field is defined, or if it is not a non-empty string, the import clause is ignored.
  • If the name field is "$ion", the import clause is ignored.
  • If no version field is defined, or if it is null, not an int, or less than 1, act as if it is 1.
  • If a max_id field is defined but is null, not an int, or less than zero, act as if it is undefined.

This issue also exists for shared symbol tables, but due to the opening disclaimer this may not actually be a flaw in the specification. A higher-level convention could exist for how repeated field names are handled. But I think that the standard should define the handling.

  • This section defines the serialized form of shared symbol tables. Unlike local symbol tables, the Ion parser does not intrinsically recognize or process this data; it is up to higher-level specifications or conventions to define how shared symbol tables are communicated.

There are tests in ion-tests/bad that cover a subset of these scenarios, specifically repeated imports and symbols fields in local symbol tables. Strictly speaking, I don't think the behavior of those tests is supported by the specification.

Updating the specification to fix this was mentioned in a related issue but doesn't seem to have happened.

Possible fixes

  1. Following the JSON philosophy: Leave it up to the implementation to select a matching field or to error.
  2. Handle this on a case-by-case basis, updating the spec with required behavior in the same way that the spec currently has case-by-case handling for what to do if a field is absent or holds an unusual value.
  3. Define a convention that when the specification refers to "the {field_name} field" it is referring to the first (alternately, the last) field with that name.
  4. Define a convention that whenever the specification refers to a field by name, it is an error for multiple fields to exist with that name.

Considerations

Inferring from the tests in ion-tests, option 4 seems to be the option that would specify the behavior of the existing implementations (though of course option 1 would as well). This would likely be the easiest to move forward with, requiring only updating the spec and adding some more tests in ion-tests to cover all of the other named fields in system structs.

But option 4 is also the worst option from a technical standpoint, for the same reason given in the FAQ for why duplicate fields are permitted in the first place.

It rules out both the high-performance approach of doing a skip-search scan across field names in the binary encoding, as well as the developer-friendly approach of just deserializing all of the fields into a clobbering hash table and then looking up the required names.

Recommendation

Option 1 would be my recommendation, given that Ion is a superset of JSON. I think this lines up best with the fact that names should be unique. People should not construct weird system Structs, and if they do, they might get weird behavior. In practice no behavior should change, since no implementations should be writing repeated fields in system structs, and existing implementations would have rejected (at least some of) them.

The work required for this change would only be to add a second caveat to the data model definition of Structures and remove tests in ion-tests.

When two fields in the same struct have the same name we say there are “repeated names” or (somewhat misleadingly) “repeated fields”. Implementations must preserve all such fields, i.e., they may not discard fields that have repeated names. However, implementations may reorder fields (the binary format identifies structs that are sorted by symbolID), so certain operations may lead to nondeterministic behavior.

would become something like

When two fields in the same struct have the same name we say there are “repeated names” or (somewhat misleadingly) “repeated fields”. Implementations must preserve all such fields, i.e., they may not discard fields that have repeated names. However, implementations may reorder fields (the binary format identifies structs that are sorted by symbolID) and may error or return any matching field when reading a repeated name from system structs, so certain operations may lead to nondeterministic behavior.

PeytonT avatar Mar 01 '20 22:03 PeytonT

Thanks for the detailed issue. I agree that the spec can use clarification here, and we should have done that when the test cases were added that prohibit repeated field names in symbol table structs.

We will need to opt for a solution like your option 4 above. In other words, we must prohibit repeated field names for system fields in system structs, and we can probably address all cases with a single statement at the top of the Symbols page of the spec. With that in mind, let me explain why by addressing some of your points.

Our decision about what to do in this case is guided by two philosophies, which we follow whenever practical: a) Ion data must be equivalent when read by any Ion implementation. If this cannot be guaranteed, parsing should fail as early as possible. Violating this rule leads to data that can only be read by certain implementations, a problem common with JSON and its variants. Option 1 violates this rule. b) An Ion writer must not allow the user to write data that violates (a). Violating this rule can lead to the need for expensive backfilling to try to salvage the data.

This issue also exists for shared symbol tables, but due to the opening disclaimer this may not actually be a flaw in the specification.

The same rules described above should apply to shared symbol tables. Some Ion libraries provide utilities that write and parse shared symbol tables via in-memory representations that can be added to a catalog for use by Ion writers and readers. The constraint is needed so that each implementation of these utilities handles shared symbol tables in the same way, since the contents of the shared symbol table affects the meaning of data.

But option 4 is also the worst option from a technical standpoint, for the same reason given in the FAQ for why duplicate fields are permitted in the first place.

It rules out both the high-performance approach of doing a skip-search scan across field names in the binary encoding, as well as the developer-friendly approach of just deserializing all of the fields into a clobbering hash table and then looking up the required names.

Prohibiting duplicate field names in symbol tables does not suffer from the O(N) space problem described in the linked FAQ because there is a known, fixed number of system fields that need to be checked. We don't need to specify that all field names in symbol tables must be unique (i.e. even open content), just symbols and imports (and, within import descriptor structs, name, version, and max_id).

For the same reason, it doesn't affect skip-scanning over open content; in structs, the field name symbol always precedes the value, so the value can be skipped without parsing once the field name is determined to not match one of the system fields. This field name check would need to be done regardless of which option from above is chosen.

As you state, it may prevent you from being able to read the struct into a map whose implementation you don't control, but it's not clear to me that that's something that would actually be desirable to do.

People should not construct weird system Structs, and if they do, they might get weird behavior.

Following rule (b) stated above, prohibiting duplicate field names on read also implies that they should be prohibited on write. So by going with Option 4, we will help users avoid getting in this situation that can be a large headache once data is already serialized. I think we can give implementations freedom about how to handle this case on write, as long as no implementation writes a symbol table with repeated system fields in the same struct.

For example, an writer may fail, or it may intercept the symbol table being written and combine repeated fields in memory in a well-defined way before writing out a valid symbol table that does not violate the rules stated above.


Let me know if anything is unclear. I'm happy to continue the discussion.

tgregg avatar Mar 04 '20 22:03 tgregg

We will need to opt for a solution like your option 4 above. In other words, we must prohibit repeated field names for system fields in system structs, and we can probably address all cases with a single statement at the top of the Symbols page of the spec.

I agree that this would be sufficient. Moreover I think the prohibition has to be restricted to system fields, else the change would be backwards-incompatible, as the spec currently says that non-system fields should be ignored.

a) Ion data must be equivalent when read by any Ion implementation. If this cannot be guaranteed, parsing should fail as early as possible.

This seems reasonable. As a corollary, I think this means that parsers can't have any limitations that can't be detected by the parser at runtime.

b) An Ion writer must not allow the user to write date that violates (a)

It's not clear to me what this means. If this this intended to mean that an Ion writer should never write output that is not Ion, that seems like it goes without saying.

Alternatively this could mean that there exists valid Ion that when read by a correct Ion parser has multiple valid results. I don't think any such valid Ion exists (though option 1 proposes that it could). Regardless, I think this is a good principle. Even if ambiguous Ion does exist, Ion writers should not write it if there is a non-ambiguous alternative.

For the same reason, it doesn't affect skip-scanning over open content; in structs, the field name symbol always precedes the value, so the value can be skipped without parsing once the field name is determined to not match one of the system fields. This field name check would need to be done regardless of which option from above is chosen.

I think my concern here has been misunderstood. I had intended that "scanning" convey stopping once the required values are found. The performance impact is not that field values would need to be parsed deeper than is required to determine their byte length, the issue is that the once symbols and imports are found, all remaining field names still need to be checked, despite being ignored, because they might be repeats of imports/symbols.

Following rule (b) stated above, prohibiting duplicate field names on read also implies that they should be prohibited on write.

Oddly enough, prohibiting duplicate field names on write intrinsically ensures that they will not be present on read, as a representation that contains them would then not be Ion. This still achieves the goal in (a), since "Ion data must be equivalent when read by any Ion implementation" is also satisfied by specifying that representations containing duplicate field names are not Ion.

But I think it's more valuable to preserve the invariant that an Ion parser should always fail if used to parse a representation that is anything other than valid Ion.


All these points being considered, it seems like option 3 (with the "first" variant) fits best. Parses are always unambiguous, (a) and (b) are achieved, and early-terminating a field name scan once (imports + symbols) / (name + version + max_id) are located is valid. Ion writers get flexibility on whether or not to write repeated system fields, since in either case they will be ignored.

No existing valid Ion would become invalid, and while some existing invalid Ion would become valid, I doubt much exists.

PeytonT avatar Mar 05 '20 05:03 PeytonT

I will argue that the case in which symbol tables have extra non-system fields is so rare that it is not worth optimizing for. That's because no existing Ion writer implementation that I know of will actually write open content within symbol table structs, and I don't think there is really a use case for that.

For that reason, I'm still leaning toward Option 4, but I am open to additional opinions.

tgregg avatar Mar 05 '20 21:03 tgregg

That's because no existing Ion writer implementation that I know of will actually write open content within symbol table structs, and I don't think there is really a use case for that.

I assumed that the reason why extra fields in a symbol table are ignored instead of prohibited is to support optional extensions, both official and user-defined. I don't see why the spec would not have prohibited them if it did not intend to support writing to them. In fact, this has led me to thinking about proposing an optional extension, where a binary local symbol table could present a max_subfield_magnitude field to inform the parser that it has the option to switch to a more efficient codepath until the next IVM is encountered.

If no writer will write them, and you don't think there is a use case for them, I'm inclined to propose the more radical fix that this problem be solved by specifying that a local symbol table contains only and exactly one imports and one symbols, with the existing logic for how to derive their values if they are absent or malformed, and likewise for name/version/max_id.

For that reason, I'm still leaning toward Option 4, but I am open to additional opinions.

Anyhow, I don't think there are much in the way of additional options. Faced with the issue of multiple of something which there should only be one of, the options are really only

  • error
  • select one in a deterministic way
  • select one in an arbitrary way
  • somehow merge them

The only remaining plausible thing I can think of that a parser could do is somehow merge duplicate fields. But what would that mean for system fields like max_id that are not collections? It wouldn't make sense without a lot of superstructure do define what "merging" means. That seems like an unreasonably complicated approach to solving this problem.

PeytonT avatar Mar 07 '20 19:03 PeytonT

It sounds like we agree on the approach: to raise an error if more than one of the same system field is defined in a system struct. Is that accurate?

tgregg avatar Mar 09 '20 19:03 tgregg

I'm not opposed to that solution. Is there a reason not to go further, and reject all unexpected fields in system structs?

PeytonT avatar Mar 10 '20 05:03 PeytonT

Is there a reason not to go further, and reject all unexpected fields in system structs?

I'm not sure we'd really gain anything by making this change. Historically we've wanted to support open content in system structs, but the reality is that full support has always fallen below the line when it comes to prioritization. Right now, we at least support reading such symbol structs with correct system semantics, even if the open content would be lost on round trip.

tgregg avatar Mar 11 '20 19:03 tgregg

Right now, we at least support reading such symbol structs with correct system semantics, even if the open content would be lost on round trip.

It seems like this is not the case, since open content in a system struct won't make it into the stream of values read from the byte stream. Since the fields are ignored the content is never read.

I'm not sure we'd really gain anything by making this change.

It preserves (actually, enhances) the optimization potential that inclined me toward options 1 and 3. If both repeated and unexpected fields were rejected, a parser would never have to read more than 2 fields to determine the validity of a system symbol table (or 3 for an import). Either there are fewer than 2 fields, or if there are two fields, they must be symbols and imports and the end of the second field value must coincide with the end of the struct. This technically breaks down in the face of NOP padding, but prohibiting NOP padding in system structs seems like an easy addition if everything that the NOP pad could be converted into is already prohibited. If the NOP padding is for alignment purposes, the writer can just pad the symbols or imports instead.

PeytonT avatar Mar 12 '20 02:03 PeytonT