message-format-wg
message-format-wg copied to clipboard
Are regexes good enough to validate literals?
A follow-up to #368.
The current draft of the registry uses named regex patterns to allow defining rules for validating literal arguments and option values.
<pattern id="positiveInteger" regex="[0-9]+"/>
And then:
<option name="minimumIntegerDigits" pattern="positiveInteger"/>
- Are regexes enough for this task?
- Should patterns be defined inline rather than referenced by id?
Whether regexes are enough depends on the type language: what is the set of possible types that can appear in function signatures? If the type language includes numeric types, finite enumerations, and strings, then regexes are enough. If the set of possible types includes nested lists, for example, then regexes aren't enough to validate arguments.
I think regex pattern
values are sufficient, given that they complement the explicit values
list. As for their inlining vs. referentiality, I think I'd need to see what e.g. the JS Intl set of formatters would look like as a registry in order to really say.
However, I do have two other related thoughts on this:
- Right now, the
pattern
attribute of the<input>
and<match>
elements is anNMTOKEN
, while<option>
usesIDREF
. Presumably they should all beIDREF
values? - We should ensure that there's a way to refer to an external source for match values. Specifically for plurals, it should be possible to refer to the CLDR supplemental/plurals.xml for locale-specific values.
Reflexes are not sufficient for validity of all data types. Example: valid locale identifiers.
Well-formed locale IDs can be verified by a (horrendous) regex, but not valid
On Wed, Jul 5, 2023, 04:52 Eemeli Aro @.***> wrote:
I think regex pattern values are sufficient, given that they complement the explicit values list. As for their inlining vs. referentiality, I think I'd need to see what e.g. the JS Intl set of formatters would look like as a registry in order to really say.
However, I do have two other related thoughts on this:
- Right now, the pattern attribute of the and
elements is an NMTOKEN, while - We should ensure that there's a way to refer to an external source for match values. Specifically for plurals, it should be possible to refer to the CLDR supplemental/plurals.xml https://github.com/unicode-org/cldr/blob/main/common/supplemental/plurals.xml for locale-specific values.
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/407#issuecomment-1621606634, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMAGTRRGQHSII7DMBYTXOVIQ3ANCNFSM6AAAAAAZ4VJZOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
The specific question for LDML45 release is: do we need to define something other than regexes in order to deliver the default registry? If so, what?
I think it is ok to just have regexes, if we document that the regex match is necessary for well-formedness, but not sufficient. And definitely not for validity.
On Thu, Jan 11, 2024, 08:59 Addison Phillips @.***> wrote:
The specific question for LDML45 release is: do we need to define something other than regexes in order to deliver the default registry? If so, what?
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/407#issuecomment-1887580645, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMAVZBJ2TYI6BWDDLRDYOAK5VAVCNFSM6AAAAAAZ4VJZOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBXGU4DANRUGU . You are receiving this because you commented.Message ID: @.***>
The unresolved tension here is that MF2.0 turned out to be untyped at the specification level. Implementations in strongly typed languages will want to make use of typing. Weakly or untyped implementations might want to use serializations common to their runtime for objects such as dates, numbers, etc. which are idiosyncratic and don't match the regex we provide in the default registry. A regex can really only describe a string serialization. What we do not want is for, say, :datetime
to require all date and time values to be provided as some flavor of ISO8601/SEDATE/etc. (see this draft for details). In (let's say) Java, we want to accept Date
, Temporal
, Calendar
, (and maybe long
) for this function.
I completely agree. The regex can only constrain values if and when datetypes are serialized.
In addition, an implementation of MF2.0 must be allowed to convert the string format of a message into an equivalent internal structure that replaces values by native datatypes.
On Thu, Jan 11, 2024 at 5:08 PM Addison Phillips @.***> wrote:
The unresolved tension here is that MF2.0 turned out to be untyped at the specification level. Implementations in strongly typed languages will want to make use of typing. Weakly or untyped implementations might want to use serializations common to their runtime for objects such as dates, numbers, etc. which are idiosyncratic and don't match the regex we provide in the default registry. A regex can really only describe a string serialization. What we do not want is for, say, :datetime to require all date and time values to be provided as some flavor of ISO8601/SEDATE/etc. (see this draft https://w3c.github.io/timezone for details). In (let's say) Java, we want to accept Date, Temporal, Calendar , (and maybe long) for this function.
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/407#issuecomment-1888232337, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMBYM5MODXA62WQMYDDYOCEIDAVCNFSM6AAAAAAZ4VJZOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBYGIZTEMZTG4 . You are receiving this because you commented.Message ID: @.***>
At least in my mind the regexp are intended to validate literal value that are already part of the placeholder, not input arguments.
Things like ... {|1234.56| :number}...
and ... {|2023-12-30T21:37| :datetime ...} ...
I think this is being addressed in the default registry work.
A regex is insufficient to describe the implementation defined input types. So we use text for that.
What a regex is sufficient for is defining the literal values that can be used in lieu of the input type. Functions are not required to accept a literal value for an operand or option value, but if they do, they must define what the literal's format is. Since this has to be a sequence of characters, a regex can do the job. What's more, a regex is extremely useful in ensuring interoperability between platforms in terms of the literal syntax.
For example, in LDML45, we accept (among other things) the XMLSchema date
syntax as a literal for date/time values. This means that the following expression is valid and has the same interpretation in every MF2 implementation, even though the implementation of the :datetime
function is wholly different:
{|2024-02-17| :datetime}
If this is what we mean, let's answer the question this issue poses as "yes" and close this issue.
Okay @stasm?
As long as the regex is an 'outer bound'. That is, the function won't accept any literal that doesn't match the regex, but doesn't have to accept everything that does match.
On Sun, Feb 18, 2024, 11:22 Addison Phillips @.***> wrote:
I think this is being addressed in the default registry work.
A regex is insufficient to describe the implementation defined input types. So we use text for that.
What a regex is sufficient for is defining the literal values that can be used in lieu of the input type. Functions are not required to accept a literal value for an operand or option value, but if they do, they must define what the literal's format is. Since this has to be a sequence of characters, a regex can do the job. What's more, a regex is extremely useful in ensuring interoperability between platforms in terms of the literal syntax.
For example, in LDML45, we accept (among other things) the XMLSchema date syntax as a literal for date/time values. This means that the following expression is valid and has the same interpretation in every MF2 implementation, even though the implementation of the :datetime function is wholly different:
{|2024-02-17| :datetime}
If this is what we mean, let's answer the question this issue poses as "yes" and close this issue.
Okay @stasm https://github.com/stasm?
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/407#issuecomment-1951420675, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMGJNDM4WM24H5425RTYUJIHZAVCNFSM6AAAAAAZ4VJZOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJRGQZDANRXGU . You are receiving this because you commented.Message ID: @.***>
@macchiati noted:
As long as the regex is an 'outer bound'. That is, the function won't accept any literal that doesn't match the regex, but doesn't have to accept everything that does match.
That's right. For example, this is "valid" but probably doesn't work for multiple reasons: {|2024-02-35| :date}
The problem is that the current wording in registry.md does not at all make that clear. If anything, the wording (4 sentences with "regex", below) makes it seem that all and only the matches to regex are valid. Until that wording is fixed, I think we should leave this issue open.
Named <validationRule> elements can optionally define regex validation rules for literals, option values, and variant keys. ... <validationRule id="anyNumber" regex="-?[0-9]+(.[0-9]+)"/> <validationRule id="positiveInteger" regex="[0-9]+"/> <validationRule id="currencyCode" regex="[A-Z]{3}"/>
On Sun, Feb 18, 2024 at 11:48 AM Addison Phillips @.***> wrote:
@macchiati https://github.com/macchiati noted:
As long as the regex is an 'outer bound'. That is, the function won't accept any literal that doesn't match the regex, but doesn't have to accept everything that does match.
That's right. For example, this is "valid" but probably doesn't work for multiple reasons: {|2024-02-35| :date}
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/407#issuecomment-1951427352, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMF7CGAHLORP4KOBREDYUJLJ3AVCNFSM6AAAAAAZ4VJZOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJRGQZDOMZVGI . You are receiving this because you were mentioned.Message ID: @.***>
For example, this is "valid" but probably doesn't work for multiple reasons:
{|2024-02-35| :date}
I think you meant something like {|2024-02-31| :date}
. The former is not syntactically valid, because dayFrag
is constrained to 01 through 31.
Moved to v46