Canonicalization: container types
Canonicalization problem statement: Given the logical value of an instance of a model class, define the single canonical serialized byte sequence that represents that instance.
We discussed container classes at the 24 Feb 2023 canonicalization meeting, and one of the goals was to make canonical processing both a) general purpose / flexible and b) simple. These are often conflicting goals, but it is possible to address both by defining an information model.
Sebastian presented a hypothetical example of serialized JSON containers:
[
{
"byteRangeStart": 23,
"byteRangeEnd": 120
},
{
"byteRangeStart": 312,
"byteRangeEnd": 352
}
]
The example has two container levels: a group of ranges, and start and end values of each range. We can see the JSON data types of each container (list and object, respectively) but we don't know anything about the semantics of each container as defined by a hypothetical logical model.
UML Section 7.8.8 defines MultiplicityElement attributes:
- lower (non-negative integer)
- upper (unlimited natural number)
- isUnique (boolean)
- isOrdered (boolean)
Logical models specify lower and upper cardinality bounds but generally not ordering or uniqueness constraints. The typical names used for containers with these constraints and the corresponding JSON types are:
| Container Type | isOrdered | isUnique | JSON type |
|---|---|---|---|
| List | true | false | Array |
| Set | false | true | Object (keys) |
| OrderedSet | true | true | none |
| Bag | false | false | none |
Class: The logical value of a class with named properties is a map with a Set of keys where property order does not matter. But canonicalization requires a fixed ordering. This can usually be achieved by sorting property names into lexical order, but when serialized values are viewed by people it may be desirable to present them in a different order, such as start before end, or name first. An information model defines a fixed canonical order (OrderedSet) for properties of each class, both supporting a desired presentation order and avoiding runtime sorting.
For both efficient data formats (JSON data for web forms, CBOR, Thrift, Avro, Protobuf, ...) and table presentations (spreadsheets), defining a fixed property order allows classes to be serialized as a List of values without transmitting any keys.
Multiplicity: The logical value of a group of values (maximum cardinality greater than one) can be any of the four types but is seldom declared. Canonicalizing an unordered container (Set or Bag) requires significant effort to both define a hierarchical sorting algorithm and perform it at runtime (analogous to creating a Merkel tree), so declaring groups to be logically order-preserving (List or OrderedSet) makes canonicalization easier to define and perform even when a specific order is not significant to applications.
Decisions
- Should the serialization/information model define a canonical order for class properties?
- Should the logical model require property values with cardinality>1 to be order-preserving?
For people who were not in the call and might be wondering, please note this is a hypothetical example of some data.
The data type looks like PositiveIntegerRange, but in the current model there is no property that has two such values.
Consider this as a generic example for a series of complex data type values.
Here are some random data that include nested structures. Remember: all arrays are (unordered) sets!
[
{
"isFake": true,
"description": "Pariatur magna non cillum reprehenderit.",
"created": "2017-03-09T09:04:35 -01:00",
"profiles": [ "culpa", "laboris", "proident", "consequat" ],
"people": [
{
"firstname": "Teri",
"lastname": "Byers",
"coordinates": [ 61.845994, 109.792315 ]
},
{
"firstname": "Madden",
"lastname": "Wilder",
"coordinates": [ 42.087053, 156.674823 ]
},
{
"firstname": "English",
"lastname": "Pruitt",
"coordinates": [ 42.488244, -153.815277 ]
},
{
"firstname": "Carney",
"lastname": "Herrera",
"coordinates": [ 5.45032, -178.228057 ]
}
]
},
{
"isFake": true,
"description": "Duis qui culpa fugiat esse cillum.",
"profiles": [ "dolore", "irure", "aliquip" ],
"people": [
{
"firstname": "Spears",
"lastname": "Mooney",
"coordinates": [ -77.444587, -127.122345 ]
},
{
"firstname": "Juliet",
"lastname": "Beck",
"coordinates": [ -4.476736, -138.625264 ]
}
]
},
{
"description": "Aliquip aute nulla exercitation cupidatat.",
"created": "2019-02-19T01:53:56 -01:00",
"profiles": [ "ad", "sunt" ],
"people": [
{
"firstname": "Lee",
"lastname": "Schwartz",
"coordinates": [ -65.795542, 130.787931 ]
},
{
"firstname": "Haley",
"lastname": "Fuentes",
"coordinates": [ -70.789592, -69.978773 ]
},
{
"firstname": "Moore",
"lastname": "Ball",
"coordinates": [ -85.688923, 123.084997 ]
},
{
"firstname": "Hurst",
"lastname": "Crawford",
"coordinates": [ 23.385168, -82.923035 ]
}
]
}
]
or minimized:
[{"isFake":true,"description":"Pariatur magna non cillum reprehenderit.","created":"2017-03-09T09:04:35 -01:00","profiles":["culpa","laboris","proident","consequat"],"people":[{"firstname":"Teri","lastname":"Byers","coordinates":[61.845994,109.792315]},{"firstname":"Madden","lastname":"Wilder","coordinates":[42.087053,156.674823]},{"firstname":"English","lastname":"Pruitt","coordinates":[42.488244,-153.815277]},{"firstname":"Carney","lastname":"Herrera","coordinates":[5.45032,-178.228057]}]},{"isFake":true,"description":"Duis qui culpa fugiat esse cillum.","profiles":["dolore","irure","aliquip"],"people":[{"firstname":"Spears","lastname":"Mooney","coordinates":[-77.444587,-127.122345]},{"firstname":"Juliet","lastname":"Beck","coordinates":[-4.476736,-138.625264]}]},{"description":"Aliquip aute nulla exercitation cupidatat.","created":"2019-02-19T01:53:56 -01:00","profiles":["ad","sunt"],"people":[{"firstname":"Lee","lastname":"Schwartz","coordinates":[-65.795542,130.787931]},{"firstname":"Haley","lastname":"Fuentes","coordinates":[-70.789592,-69.978773]},{"firstname":"Moore","lastname":"Ball","coordinates":[-85.688923,123.084997]},{"firstname":"Hurst","lastname":"Crawford","coordinates":[23.385168,-82.923035]}]}]
Test data should include both valid and invalid examples to ensure that data corresponds to the logical model.
A negative example with repeated values determines if the people array is a Set.
A negative example with disordered values determines if the coordinates array is correctly converted to a property Set:
{
"people": [
{
"firstname": "Teri",
"lastname": "Byers",
"coordinates": [ 61.845994, 109.792315 ]
},
{
"firstname": "Madden",
"lastname": "Wilder",
"coordinates": [ 156.674823, 42.087053 ]
},
{
"firstname": "English",
"lastname": "Pruitt",
"coordinates": [ 42.488244, -153.815277 ]
},
{
"firstname": "Teri",
"lastname": "Byers",
"coordinates": [ 61.845994, 109.792315 ]
}
]
}
The coordinates value is an example of how an information model defines serialization capabilities that are impossible to express using JSON schema:
- The logical value is a Set of two properties, assumed to be latitude and longitude, both required
- The JSON serialized value is an Array with no property names
An Information model that validates these examples is:
Team = Record
1 isFake Boolean optional
2 description String
3 created DateTime
4 profiles ArrayOf(Profile){1..*} set
5 people ArrayOf(Person){2..4} set
Person = Record
1 firstname String
2 lastname String
3 coordinates Coordinate
Coordinate = Record array
1 latitude Latitude
2 longitude Longitude
3 altitude Altitude optional
- We guess that the top-level type might be something called a "Team", and from the data that a team conservatively has between two and four members (it is easier to guess unlimited size [1..*], but extrapolating from known data is more precise). Reverse-engineering a logical model from data involves many guesses and assumptions, some of which can only be resolved by knowing the logical model.
- We assume that the "Coordinate" type is defined in a logical model. An information model maps the Coordinate logical type to the pre-defined "Record" information type. A logical Coordinate instance is the same regardless of whether or how that instance is serialized.
- The "Record" information type is a container holding a set of fields, where each field has a name, an ordinal position, and a type. Field names and positions are local to the container.
- "set" is a type option indicating the isOrdered and isUnique constraints of an ArrayOf information type.
- "array" is a type option indicating that Record instances are always serialized as an ordered list of values in the specified order. Without the "array" option, Record instances are serialized either as a map of named properties in verbose data formats or an ordered list of values in concise data formats.
- Since we are not given a logical model, we assume from both the property name "coordinates" and the given array values that the array represents a GPS location.
- From this data sample we have no way of knowing if a logical Coordinate has additional optional properties, so we include an example "altitude" property to illustrate the possibility.