Ambiguity of Duplicate Symbols in the Specification
Summary
There are rules around how duplicate symbol definitions are to be treated which are ambiguous. we propose that we define that duplicate symbol definitions (imported or locally defined) for symbols that are defined in the system symbol table are always converted to $0 and all other duplicates to be defined as-is (meaning that non-system defined symbol table definitions may be duplicated).
The rationale is two-fold:
- There can only be one SID for such symbols, this makes parsing system values easier.
- In the presence of symbol table imports, only the system symbols are ever deterministically known. While we could special case this substitution when no imports are available, this will lead to extra complexity that I don't believe is warranted (i.e., sometimes locally defined symbol duplicates become
$0and other times they do not).
Background
During a work session with @tgregg, @zslayton, and @toddjonker we ran into an ambiguous part of the Ion 1.0 specification around duplicate symbol definitions that probably is not implemented in a consistent way in any implementation of Ion.
Specifically, the Local Symbol Table Semantics indicates (emphasis mine):
When mapping from string to symbol ID, there may be multiple assigned IDs; implementations MUST select the lowest known ID. If an imported table is unavailable, this may cause selection of a greater ID than would be the case otherwise. This restriction ensures that symbols defined by system symbol tables can never be mapped to other IDs.
And the Shared Symbol Table Semantics indicates:
When mapping from string to symbol ID, there may be multiple associated IDs (the same string could appear twice as children of the symbols field). Implementations MUST select the lowest known ID, and all other associated IDs MUST be handled as if undefined.
This implies that a symbol definition that is repeated must choose the lowest ID, and any other symbol definition of that text must be treated as $0. There are distinct limitations with this proposition, but let's consider the following Ion text:
$ion_1_0
$ion_symbol_table::{symbols:["$ion_symbol_table"]}
// Exhibit A
$10
// Exhibit B
$10::{symbols:["foo"]}
// Exhibit C
$10
The current specification would mean that exhibit A should mean $0, exhibit B should mean $0::{symbols:["foo"]}, and exhibit C should mean $0. I was surprised by this interpretation and Ion Java, Ion Python, and Ion C do not implement this behavior.
The ambiguity arises when we have something like the following:
$ion_1_0
$ion_symbol_table::{
import:[{"tricky", version: 1, max_id: 1}],
symbols:["foo"]
}
// Exhibit D
$11
Let's assume that tricky is defined as follows:
$ion_shared_symbol_table::{
name: "tricky",
version: 1,
symbols: ["foo"]
}
Now let's assume a writer of the data in exhibit D, does not have access to tricky, what does $11 mean? The specification says that the "lowest known" must be selected, but since a reader can never know what a writer's "lowest known" was when imports are used, the substitution to $0 for such encodings can never be done.
So this leaves us with the following question. Do we define the $0 substitution as conditional or do we eliminate this substitution in general (i.e., duplicates are allowed as-is because we cannot prevent them) or do we do something else with duplicate symbol definitions? As indicated in the summary, we believe it should only be substituted for system symbols, and allowed to be duplicated for all other symbols.
Examples
Let's consider the following repeated example:
$ion_1_0
$ion_symbol_table::{symbols:["$ion_symbol_table"]}
// Exhibit A
$10
// Exhibit B
$10::{symbols:["foo"]}
// Exhibit C
$10
This is equivalent to $ion_symbol_table::{symbols:[null]} and therefore $10 in A, B, and C are all $0.
Let's have a slightly modified example with the tricky symbol table (defined previously):
$ion_1_0
$ion_symbol_table::{
import:[{"tricky", version: 1, max_id: 1}],
symbols:["foo"]
}
// Exhibit D
$10
// Exhibit E
$11
In this case $10 and $11 are both foo. It is never ambiguous what $11 means depending on the context of a reader or writer.
Let's consider the following shared symbol table definition:
$ion_shared_symbol_table::{
name: "shadow",
version: 1,
symbols: ["$ion_symbol_table"]
}
And let's consider the following:
$ion_1_0
$ion_symbol_table::{imports:[{"shadow", version: 1, max_id: 1}]}
// Exhibit F
$10
// Exhibit G
$10::{symbols:["foo"]}
// Exhibit H
$10
In the above, $10 is likewise always $0, when importing shadow the Ion processor basically replaces the duplicate with null on load.
Appendix: Ion Python (via Ion C)
Ion Python as an example does not do at all what the specification says and treats duplicate system symbols as synonyms.
>>> from amazon.ion.simpleion import *
>>> loads(
b'$ion_1_0 $ion_symbol_table::{symbols:["$ion_symbol_table"]} $10'
, single_value = False)
[IonPySymbol(text='$ion_symbol_table', sid=None, location=None)]
>>> loads(
b'$ion_1_0 $ion_symbol_table::{symbols:["$ion_symbol_table"]} $10 $10::{symbols:["foo"]} $10'
, single_value = False
[IonPySymbol(text='$ion_symbol_table', sid=None, location=None), IonPySymbol(text='foo', sid=None, location=None)]
Appendix: Ion Java
Ion Java as another example also does not comply, but also behaves differently than Ion Python/Ion C.
Example in Kotlin:
import software.amazon.ion.system.IonSystemBuilder
fun main(args: Array<String>) {
val sys = IonSystemBuilder.standard().build()
val data = sys.loader.load("""
${'$'}ion_1_0
${'$'}ion_symbol_table::{symbols:["${'$'}ion_symbol_table"]}
${'$'}10
${'$'}10::{symbols:["foo"]}
${'$'}10
""")
println(data)
}
Prints out (newlines added for clarity):
$ion_1_0
$ion_symbol_table::{symbols:["$ion_symbol_table"]}
$ion_symbol_table $ion_symbol_table::{symbols:["foo"]}
$ion_symbol_table