message-format-wg
message-format-wg copied to clipboard
[FEEDBACK] The data model could be simplified for literals
This was originally posted as a part of #716, but is broken out here as its own issue.
While working on some Python code, I ended up needing to put together a pythonic representation of the message data model. While doing so, I encountered a few places where I could apply some simplifications to the data model with no loss of fidelity.
I think we should consider applying this change to how literals are represented in the data model:
interface Variant {
- keys: Array<Literal | CatchallKey>;
+ keys: Array<string | CatchallKey>;
value: Pattern;
}
interface LiteralExpression {
type: "expression";
- arg: Literal;
+ arg: string;
annotation?: FunctionAnnotation | UnsupportedAnnotation;
attributes: Attribute[];
}
interface Attribute {
name: string;
- value?: Literal | VariableRef;
+ value?: string | VariableRef;
}
interface Option {
name: string;
- value: Literal | VariableRef;
+ value: string | VariableRef;
}
-interface Literal { type: "literal", value: string }
In each of the positions available for Literal
, it may also be a VariableRef
or a CatchallKey
. As we explicitly do not consider the quoting of the literal as significant, we don't need the object wrapping for its string value. And so we should get rid of it.
I'd prefer to keep explicit Literal
nodes in the data model.
- They make it trivial to query for all literals in a message.
- They make it convenient to extend the canonical data model and add information about line numbers or character offsets, for debugging and error display purposes.
They make it trivial to query for all literals in a message.
The ease of writing that query is unaffected by the data type. Having written a visitor for the data model, here's what it looks like with the current data model:
const literals = [];
visit(msg, {
key(k) { if (k.type === 'literal') literals.push(k) },
value(v) { if (v.type === 'literal') literals.push(v) }
});
And here's how it would look like if Literals were strings instead:
const literals = [];
visit(msg, {
key(k) { if (typeof k === 'string') literals.push(k) },
value(v) { if (typeof v === 'string') literals.push(v) }
});
The visitor offers a number of methods for various parts of the data model; key()
is called for each variant key, while value()
is called for each operand, option value, and attribute value.
They make it convenient to extend the canonical data model and add information about line numbers or character offsets, for debugging and error display purposes.
I too started from this presumption, but in practice when writing a parser and later a language server found that a separate CST is effectively required to represent malformed contents. For example, the data model offers no easy slot in which to represent whitespace or syntax characters like =
, or partial structures during live input.
As a valid literal cannot by itself be in error, it can only be so in the context of its expression or variant. This means that no matter what the literal representation is, the context in which its error is determined is an object which may either directly provide for its positioning, or it may refer to a corresponding CST node through which the literal's source position may be determined.
In other words, similarly to the visitor example above, my own experience would indicate that having a separate object wrapper for literals does not actually increase the convenience of finding its source position when it actually matters.
visit(msg, { literal(l) { literals.push(l) })
is shorter, and more easily reproduced in a typed implementation language.
Sure, it's possible for an alternative visitor to provide an even terser expression. But the general point seems to be that a wrapper around the literal value does not make this operation easier.
The difference between these two visitors is that one requires the user to know MF2's semantics (I know that literals can be keys or option values, or attribute values (which are not used right now)), while the other just refers to the inherent type of the node in the data model.
The latter approach seems to be more explicit, versatile, and agnostic.