concise-encoding
concise-encoding copied to clipboard
What would be the standard way to handle enumerated types?
I've looked at the format specification, and the "main issue" I saw, was that there is no clear way to handle "enumerated types". By enumerated types, I mean you have names/labels, normally scoped by some group/type (to eliminate clashes, if the same name is used in different contexts) that map to an integer ID, where each name/label and ID must be unique within this group/type, enabling unambiguous bi-directional mapping. In the absence of a group/type "context", a simple "prefix", representing that context, for the name/label could be used.
And the issue is, you want to specify your enumerations as "strings" (labels) in text format, but as "integer" in binary format, because that is way more compact. If this is not directly supported in the format, everyone will invent their own way of dealing with this. At the very least, it would be good to have the specification suggest a standard solution, to make the data more "portable".
Hmm... Something like this would require a schema to specify what enumerations are available, and their mappings. I have a schema on my TODO list, but haven't delved too deeply into its design yet. I suppose we would also need a special syntax to differentiate enumerated values from regular strings... something like #RegularCustomer
#PreferredCustomer
and such... maybe? Probably best to have that kind of annotation so that humans know it's an enum and not a string...
This will require some careful design, since there's also the issue of converting between binary and text while substituting enumerated values, in which case both formats would need to know which fields are enumerations and which are regular text or integer. Probably in the schema somehow...
Hello Karl. I realised, after posting the issue, that a simple mapping of name to integer, using a name "prefix" for the enumeration type, would only work one way, from name to integer, but not the other way, since the integers would not be "unique". So as you say, this can only be solved, if we also encode that a specific integer value belongs to a specific enumeration. I can't see how to do this without a schema either.
On Thu, Oct 22, 2020 at 12:16 PM Karl Stenerud [email protected] wrote:
Hmm... Something like this would require a schema to specify what enumerations are available, and their mappings. I have a schema on my TODO list, but haven't delved too deeply into its design yet. I suppose we would also need a special syntax to differentiate enumerated values from regular strings... something like #RegularCustomer #PreferredCustomer and such... maybe? Probably best to have that kind of annotation so that humans know it's an enum and not a string...
This will require some careful design, since there's also the issue of converting between binary and text while substituting enumerated values, in which case both formats would need to know which fields are enumerations and which are regular text or integer. Probably in the schema somehow...
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kstenerud/concise-encoding/issues/8#issuecomment-714390305, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIUHBOZSPO4E75K7UVFUVDSMABAPANCNFSM4S24Y27A .
All the same, it's good to be thinking of this even now before the schema spec exists, because we will need some way to mark enumerated values in the text format. I think #
would be a good candidate for enumerated values. However, using enumerated types would mean that without the schema, you cannot ingest the data, or convert it to the binary format.
A (probably bad) idea occurs: Allow a special notation format so that the integer value can be added so that recipients without the schema can still handle the data.
Example:
Schema specifies:
- enum size(small=0, medium=1, large=2)
When converting from CBE to CTE with a schema, use explicit format:
c1
{
widget = {
head-size = #large:2 // adding a :<number> to the end to specify what the enum actually maps to
tail-size = #small:0
}
}
Pros:
- With this format, a decoder without the schema could still read the document.
Cons:
- If a human tries to use this format and gets the mapping wrong, we're in trouble.
But then again, if a human messes up the spelling of an enum, we're in trouble as well. Not sure how much additional trouble it would cause to allow explicit mode...
I don't remember seeing that notation (label:id in the document, rather than the schema) elsewhere. Personally, I really like that idea. And while a human can mess things up when creating the document, we still have the ability to validate it, once we have a schema. But the important thing is, that the binary format can be efficient. Also, once we have a schema, a "code editor plugin" could be used while the human is typing, to fill-in the ID automatically.
On Sat, Oct 24, 2020 at 4:47 PM Karl Stenerud [email protected] wrote:
All the same, it's good to be thinking of this even now before the schema spec exists, because we will need some way to mark enumerated values in the text format. I think # would be a good candidate for enumerated values. However, using enumerated types would mean that without the schema, you cannot ingest the data, or convert it to the binary format.
A (probably bad) idea occurs: Allow a special notation format so that the integer value can be added so that recipients without the schema can still handle the data.
Example:
Schema specifies:
- enum size(small=0, medium=1, large=2)
When converting from CBE to CTE with a schema, use explicit format:
c1 { widget = { head-size = #large:2 // adding a :
to the end to specify what the enum actually maps to tail-size = #small:0 } } Pros:
- With this format, a decoder without the schema could still read the document.
Cons:
- If a human tries to use this format and gets the mapping wrong, we're in trouble.
But then again, if a human messes up the spelling of an enum, we're in trouble as well. Not sure how much trouble it would cause to allow explicit mode...
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kstenerud/concise-encoding/issues/8#issuecomment-715925423, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIUHBOVSJSYGXWBLD2GUJ3SMLSIPANCNFSM4S24Y27A .
OK, here's a first stab at it:
- https://github.com/kstenerud/concise-encoding/blob/master/ce-structure.md#constant
- https://github.com/kstenerud/concise-encoding/blob/master/cte-specification.md#constant
Implementing them as constants gives more freedom of expression I think. Restricting a field to enumerated values (defined using enumerated type constants in the schema) will be the schema's job.
This might need some tweaking once there's an actual schema format, but for now I think it gets the idea across.
This will do nicely.
On Mon, Oct 26, 2020 at 8:47 AM Karl Stenerud [email protected] wrote:
OK, here's a first stab at it:
https://github.com/kstenerud/concise-encoding/blob/master/ce-structure.md#constant
https://github.com/kstenerud/concise-encoding/blob/master/cte-specification.md#constant
Implementing them as constants gives more freedom of expression I think. Restricting a field to enumerated values (defined using enumerated type constants in the schema) will be the schema's job.
This might need some tweaking once there's an actual schema format, but for now I think it gets the idea across.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kstenerud/concise-encoding/issues/8#issuecomment-716373250, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIUHBIVCQSKXKTILCDQFQTSMUSQPANCNFSM4S24Y27A .