message-format-wg icon indicating copy to clipboard operation
message-format-wg copied to clipboard

Are regexes good enough to validate literals?

Open stasm opened this issue 1 year ago • 14 comments

A follow-up to #368.

The current draft of the registry uses named regex patterns to allow defining rules for validating literal arguments and option values.

<pattern id="positiveInteger" regex="[0-9]+"/>

And then:

<option name="minimumIntegerDigits" pattern="positiveInteger"/>
  • Are regexes enough for this task?
  • Should patterns be defined inline rather than referenced by id?

stasm avatar Jul 03 '23 16:07 stasm

Whether regexes are enough depends on the type language: what is the set of possible types that can appear in function signatures? If the type language includes numeric types, finite enumerations, and strings, then regexes are enough. If the set of possible types includes nested lists, for example, then regexes aren't enough to validate arguments.

catamorphism avatar Jul 05 '23 02:07 catamorphism

I think regex pattern values are sufficient, given that they complement the explicit values list. As for their inlining vs. referentiality, I think I'd need to see what e.g. the JS Intl set of formatters would look like as a registry in order to really say.

However, I do have two other related thoughts on this:

  1. Right now, the pattern attribute of the <input> and <match> elements is an NMTOKEN, while <option> uses IDREF. Presumably they should all be IDREF values?
  2. We should ensure that there's a way to refer to an external source for match values. Specifically for plurals, it should be possible to refer to the CLDR supplemental/plurals.xml for locale-specific values.

eemeli avatar Jul 05 '23 11:07 eemeli

Reflexes are not sufficient for validity of all data types. Example: valid locale identifiers.

Well-formed locale IDs can be verified by a (horrendous) regex, but not valid

On Wed, Jul 5, 2023, 04:52 Eemeli Aro @.***> wrote:

I think regex pattern values are sufficient, given that they complement the explicit values list. As for their inlining vs. referentiality, I think I'd need to see what e.g. the JS Intl set of formatters would look like as a registry in order to really say.

However, I do have two other related thoughts on this:

  1. Right now, the pattern attribute of the and elements is an NMTOKEN, while
  2. We should ensure that there's a way to refer to an external source for match values. Specifically for plurals, it should be possible to refer to the CLDR supplemental/plurals.xml https://github.com/unicode-org/cldr/blob/main/common/supplemental/plurals.xml for locale-specific values.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/407#issuecomment-1621606634, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMAGTRRGQHSII7DMBYTXOVIQ3ANCNFSM6AAAAAAZ4VJZOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

macchiati avatar Jul 05 '23 19:07 macchiati

The specific question for LDML45 release is: do we need to define something other than regexes in order to deliver the default registry? If so, what?

aphillips avatar Jan 11 '24 16:01 aphillips

I think it is ok to just have regexes, if we document that the regex match is necessary for well-formedness, but not sufficient. And definitely not for validity.

On Thu, Jan 11, 2024, 08:59 Addison Phillips @.***> wrote:

The specific question for LDML45 release is: do we need to define something other than regexes in order to deliver the default registry? If so, what?

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/407#issuecomment-1887580645, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMAVZBJ2TYI6BWDDLRDYOAK5VAVCNFSM6AAAAAAZ4VJZOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBXGU4DANRUGU . You are receiving this because you commented.Message ID: @.***>

macchiati avatar Jan 11 '24 20:01 macchiati

The unresolved tension here is that MF2.0 turned out to be untyped at the specification level. Implementations in strongly typed languages will want to make use of typing. Weakly or untyped implementations might want to use serializations common to their runtime for objects such as dates, numbers, etc. which are idiosyncratic and don't match the regex we provide in the default registry. A regex can really only describe a string serialization. What we do not want is for, say, :datetime to require all date and time values to be provided as some flavor of ISO8601/SEDATE/etc. (see this draft for details). In (let's say) Java, we want to accept Date, Temporal, Calendar , (and maybe long) for this function.

aphillips avatar Jan 12 '24 01:01 aphillips

I completely agree. The regex can only constrain values if and when datetypes are serialized.

In addition, an implementation of MF2.0 must be allowed to convert the string format of a message into an equivalent internal structure that replaces values by native datatypes.

On Thu, Jan 11, 2024 at 5:08 PM Addison Phillips @.***> wrote:

The unresolved tension here is that MF2.0 turned out to be untyped at the specification level. Implementations in strongly typed languages will want to make use of typing. Weakly or untyped implementations might want to use serializations common to their runtime for objects such as dates, numbers, etc. which are idiosyncratic and don't match the regex we provide in the default registry. A regex can really only describe a string serialization. What we do not want is for, say, :datetime to require all date and time values to be provided as some flavor of ISO8601/SEDATE/etc. (see this draft https://w3c.github.io/timezone for details). In (let's say) Java, we want to accept Date, Temporal, Calendar , (and maybe long) for this function.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/407#issuecomment-1888232337, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMBYM5MODXA62WQMYDDYOCEIDAVCNFSM6AAAAAAZ4VJZOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBYGIZTEMZTG4 . You are receiving this because you commented.Message ID: @.***>

macchiati avatar Jan 15 '24 18:01 macchiati

At least in my mind the regexp are intended to validate literal value that are already part of the placeholder, not input arguments. Things like ... {|1234.56| :number}... and ... {|2023-12-30T21:37| :datetime ...} ...

mihnita avatar Jan 29 '24 18:01 mihnita

I think this is being addressed in the default registry work.

A regex is insufficient to describe the implementation defined input types. So we use text for that.

What a regex is sufficient for is defining the literal values that can be used in lieu of the input type. Functions are not required to accept a literal value for an operand or option value, but if they do, they must define what the literal's format is. Since this has to be a sequence of characters, a regex can do the job. What's more, a regex is extremely useful in ensuring interoperability between platforms in terms of the literal syntax.

For example, in LDML45, we accept (among other things) the XMLSchema date syntax as a literal for date/time values. This means that the following expression is valid and has the same interpretation in every MF2 implementation, even though the implementation of the :datetime function is wholly different:

{|2024-02-17| :datetime}

If this is what we mean, let's answer the question this issue poses as "yes" and close this issue.

Okay @stasm?

aphillips avatar Feb 18 '24 19:02 aphillips

As long as the regex is an 'outer bound'. That is, the function won't accept any literal that doesn't match the regex, but doesn't have to accept everything that does match.

On Sun, Feb 18, 2024, 11:22 Addison Phillips @.***> wrote:

I think this is being addressed in the default registry work.

A regex is insufficient to describe the implementation defined input types. So we use text for that.

What a regex is sufficient for is defining the literal values that can be used in lieu of the input type. Functions are not required to accept a literal value for an operand or option value, but if they do, they must define what the literal's format is. Since this has to be a sequence of characters, a regex can do the job. What's more, a regex is extremely useful in ensuring interoperability between platforms in terms of the literal syntax.

For example, in LDML45, we accept (among other things) the XMLSchema date syntax as a literal for date/time values. This means that the following expression is valid and has the same interpretation in every MF2 implementation, even though the implementation of the :datetime function is wholly different:

{|2024-02-17| :datetime}

If this is what we mean, let's answer the question this issue poses as "yes" and close this issue.

Okay @stasm https://github.com/stasm?

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/407#issuecomment-1951420675, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMGJNDM4WM24H5425RTYUJIHZAVCNFSM6AAAAAAZ4VJZOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJRGQZDANRXGU . You are receiving this because you commented.Message ID: @.***>

macchiati avatar Feb 18 '24 19:02 macchiati

@macchiati noted:

As long as the regex is an 'outer bound'. That is, the function won't accept any literal that doesn't match the regex, but doesn't have to accept everything that does match.

That's right. For example, this is "valid" but probably doesn't work for multiple reasons: {|2024-02-35| :date}

aphillips avatar Feb 18 '24 19:02 aphillips

The problem is that the current wording in registry.md does not at all make that clear. If anything, the wording (4 sentences with "regex", below) makes it seem that all and only the matches to regex are valid. Until that wording is fixed, I think we should leave this issue open.

Named <validationRule> elements can optionally define regex validation rules for literals, option values, and variant keys. ... <validationRule id="anyNumber" regex="-?[0-9]+(.[0-9]+)"/> <validationRule id="positiveInteger" regex="[0-9]+"/> <validationRule id="currencyCode" regex="[A-Z]{3}"/>

On Sun, Feb 18, 2024 at 11:48 AM Addison Phillips @.***> wrote:

@macchiati https://github.com/macchiati noted:

As long as the regex is an 'outer bound'. That is, the function won't accept any literal that doesn't match the regex, but doesn't have to accept everything that does match.

That's right. For example, this is "valid" but probably doesn't work for multiple reasons: {|2024-02-35| :date}

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/407#issuecomment-1951427352, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMF7CGAHLORP4KOBREDYUJLJ3AVCNFSM6AAAAAAZ4VJZOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJRGQZDOMZVGI . You are receiving this because you were mentioned.Message ID: @.***>

macchiati avatar Feb 18 '24 22:02 macchiati

For example, this is "valid" but probably doesn't work for multiple reasons: {|2024-02-35| :date}

I think you meant something like {|2024-02-31| :date}. The former is not syntactically valid, because dayFrag is constrained to 01 through 31.

gibson042 avatar Feb 19 '24 20:02 gibson042

Moved to v46

aphillips avatar Apr 13 '24 20:04 aphillips