message-format-wg icon indicating copy to clipboard operation
message-format-wg copied to clipboard

`:unit` needs to support data slicing

Open sffc opened this issue 5 months ago • 4 comments
trafficstars

Currently, :unit suffers from many of the same problems as, for example, u:locale, explained in #1006.

Since the unit being formatted is provided by the Measure input type (which I think is a good decision), the message formatter cannot know ahead of time which units are being formatted, so it needs to pessimistically link the data for all possible units. Unit display names are among the largest chunks of data, so requiring this behavior is a non-starter for ICU4X.

A few ways to mitigate this:

  1. Add the quantity to the unit function, such as :unit:length, as suggested by @eemeli. This would allow ICU4X to link only a smaller subset of units, which should be generally small enough to be manageable.
  2. Require the quantity or the unit to be specified in the message. For example, :unit unit=meter with a Number input is OK, as is :unit quantity=length with a Measure unit.
  3. Require a unit to be specified in the message, but allow the unit to be overridden by Measure so long as the quantity matches. For example, :unit unit=meter can be formatted with a Measure carrying foot but not a Measure carrying gallon.
  4. Require the context to be specified in the message, and tweak how context works. :unit context=length would not do conversion, but conversion would happen with :unit context=length-road.

CC @macchiati @manishearth

sffc avatar Jun 09 '25 18:06 sffc

Responding to the following comment from @macchiati

What might work is for unit is to only require a small subset of unit ids to be supported. Then ICU4X and other memory-limited implementations could choose to only support the required set.

I don't think this is a good long-term solution. In ECMA-402, we struggled to come up with a subset list of units, and we landed on about 40, but users keep coming to us asking for more units. I would like to not repeat this in MF2.

sffc avatar Jun 09 '25 18:06 sffc

I'm a bit puzzled, because each of your solutions requires the message to be parsed by something, in order to find out the quantity. But if the parser sees :unit unit=meter, then knows (with a small table of information) that the quantity is length. So you can load the units with quantity=length.

[More precisely, in the general case it needs to load the units that are convertible to meter, which also includes inverse quantities. This happens, for example, with mph and liters-per-100-kilometers.]

Why would we need to force message writers to supply that information?

macchiati avatar Jun 09 '25 19:06 macchiati

:unit unit=meter is fine. Where it breaks is if the unit is specified by a Measure input (which is a good choice, but it makes data slicing harder).

sffc avatar Jun 09 '25 19:06 sffc

:unit unit=inputUnit would also be at issue. Perhaps something like an optional convertible=meter,… would allow the message writer to supply the information you need. I could certainly see something like that in the localization chain as a requirement on localization venders for messages for a given project, say, a watch UI.

macchiati avatar Jun 09 '25 20:06 macchiati

@sffc Could you provide a link to some reference for the (proposed?) ICU4X slices for unit data? I can't find one, and it looks like at least in CLDR "percent" has a full name of concentr-percent, and is not under any "portion" category, as was mentioned on our 2025-06-23 call.

eemeli avatar Jul 06 '25 13:07 eemeli

@eemeli I think the groupings are sometimes called 'types'. TR35 describes/demonstrates them here with a data file here. The prefix on each unit appears to be its type. Note that the data in the XML file is not sorted, so concentr- make a reappearance further down in the file.


I think percent is interesting because its relatively common, has historically been a formatting "special case", and requires fairly minimal data. Unit formatting that considers the data slicing problem is more complicated, since at least some implementations will want to use values that include measures (this being a Good Thing), introducing conversion/agreement issues. Compound units are also an issue (furlongs-per-fortnight). Maybe we should special case percents in order to do a better job with units?

aphillips avatar Jul 06 '25 16:07 aphillips

@aphillips My question is in fact arising from the disparity between what's in CLDR, and the "portion" name that was mentioned to be one of the ICU4X slices two weeks ago. I'm looking to find out if that was a mistake or misunderstanding, or if there is a novel ICU4X split of the unit space.

Either way, I'm starting to think that we probably ought to not split :unit up into multiple functions, but to split using a type option, so we'd allow for e.g.

{$x :unit type=volume}

to format any volume units. This would mean that at least one of type or unit options would need to always be set.

eemeli avatar Jul 06 '25 17:07 eemeli

Please do not use the type. It is outmoded, and replaced by the quantity. However, the quantity is not part of the identifier.

Use instead just the short ID.

On Sun, Jul 6, 2025, 09:24 Addison Phillips @.***> wrote:

aphillips left a comment (unicode-org/message-format-wg#1079) https://github.com/unicode-org/message-format-wg/issues/1079#issuecomment-3042174822

@eemeli https://github.com/eemeli I think the groupings are sometimes called 'types'. TR35 describes/demonstrates them here https://www.unicode.org/reports/tr35/tr35-general.html#Example_Units with a data file here https://github.com/unicode-org/cldr/blob/main/common/validity/unit.xml. The prefix on each unit appears to be its type. Note that the data in the XML file is not sorted, so concentr- make a reappearance further down in the file.

I think percent is interesting because its relatively common, has historically been a formatting "special case", and requires fairly minimal data. Unit formatting that considers the data slicing problem is more complicated, since at least some implementations will want to use values that include measures (this being a Good Thing), introducing conversion/agreement issues. Compound units are also an issue (furlongs-per-fortnight). Maybe we should special case percents in order to do a better job with units?

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/1079#issuecomment-3042174822, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMHKFITDSXRUQ7J2DRT3HFEVTAVCNFSM6AAAAAB65UZAN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTANBSGE3TIOBSGI . You are receiving this because you were mentioned.Message ID: @.***>

macchiati avatar Jul 06 '25 22:07 macchiati

@macchiati The use case we're trying to solve for here are messages for which the units are not known until formatting time; for example something like:

The destination is {$dist :unit type=length} away.

The above is localizable without reference to whether the units (carried here within $dist) are meters, miles, or solar radii.

The case being made by @sffc is that we need to include at least some option with a literal value in the expression to reduce the amount of data ICU4X needs to link or load, so that it does not need to account for $dist carrying any possible unit. Optimally, we'd also like to help the translator of the message have a better idea of what the placeholder will be replaced by.

If we should not use type as that option, what's the alternative that you propose?

eemeli avatar Jul 07 '25 08:07 eemeli

One can use the quantity or the base unit.

On Mon, Jul 7, 2025, 01:02 Eemeli Aro @.***> wrote:

eemeli left a comment (unicode-org/message-format-wg#1079) https://github.com/unicode-org/message-format-wg/issues/1079#issuecomment-3043888823

@macchiati https://github.com/macchiati The use case we're trying to solve for here are messages for which the units are not known until formatting time; for example something like:

The destination is {$dist :unit type=length} away.

The above is localizable without reference to whether the units (carried here within $dist) are meters, miles, or solar radii.

The case being made by @sffc https://github.com/sffc is that we need to include at least some option with a literal value in the expression to reduce the amount of data ICU4X needs to link or load, so that it does not need to account for $dist carrying any possible unit. Optimally, we'd also like to help the translator of the message have a better idea of what the placeholder will be replaced by.

If we should not use type as that option, what's the alternative that you propose?

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/1079#issuecomment-3043888823, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMCLKJS5DMYN5NSFBBT3HISSXAVCNFSM6AAAAAB65UZAN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTANBTHA4DQOBSGM . You are receiving this because you were mentioned.Message ID: @.***>

macchiati avatar Jul 07 '25 14:07 macchiati

Ah! The context I was missing was that "quantity" means a specific thing here, and that it's not e.g. an attribute of a Measure object when one is used as an operand. This also seems to be where the ICU4X "portion" is coming from.

FYI, the LDML section on Unit Identifiers only mentions "type", and not "quantity" at all. To an uninitiated reader, it's not really clear that the unit categorization presented in LDML Part 1 is deprecated.

eemeli avatar Jul 08 '25 06:07 eemeli

The full list of quantities are those found in units.xml.

ICU4X unit formatting is very much under construction, but you can follow along at https://github.com/unicode-org/icu4x/issues/6161

We know that percent is in the portion quantity because

        <unitQuantity baseUnit='part' quantity='portion' status='simple'/>

defines the quantity and

        <convertUnit source='percent' baseUnit='part' factor='1/100' systems="metric_adjacent ussystem uksystem"/>

maps the percent unit to the base unit of the quantity.

sffc avatar Jul 08 '25 07:07 sffc

In the ICU4X TC, we've been working in the direction of slicing units by their quantity because:

  1. The number of quantities is much smaller than the number of units, so it is easier for us to handle in code
  2. A quantity-based formatter can link data for all units that are mutually convertible, making unit preferences a natural extension

We've discussed other ways of slicing data, such as coarse "core" and "extended" buckets, or slicing by every individual unit. The quantity-based ontology isn't finalized yet but it seems like the most promising route.

sffc avatar Jul 08 '25 07:07 sffc

Let's look at these four ways to use :unit in terms of inputs:

  1. The message specifies the unit, and the input is Number-like
  2. The input is Measure-like, and it should be formatted directly without conversion
  3. The input is Measure-like, and it should be converted to an explicit unit in the message before formatting
  4. The input is Measure-like, and it should be converted to a locale preference unit before formatting

Cases 1 and 3 are not a problem for data loading since the output unit is specified in the message. I also believe that they are indistinguishable in the message syntax. Case 4 is tractable with ICU4X's proposed quantity-based slicing, so long as the unit context is associated with a quantity. So it is really only case 2 that is challenging.

# Case 1: OK
{$unit :unit unit=meter}

# Case 2 alternative A: OK
{$unit :unit quantity=length}

# Case 2 alternative B: Bad; not enough information to deduce data loading
{$unit :unit}

# Case 3: OK for display names, but would be nice to know that we need to load conversion data
{$unit :unit unit=meter}

# Case 4 alternative A: OK
{$unit :unit context=temperature-weather}

# Case 4 alternative B: OK
{$unit :unit quantity=temperature context=weather}

# Case 4 alternative C: Bad; not enough information to deduce data loading. Could be temperature, barometric pressure, wind speed, …
{$unit :unit context=weather}

sffc avatar Jul 08 '25 08:07 sffc

Based on unicode-org/icu4x#6535 and other docs linked above, it looks like ICU4X is using "category" as a synonym for the CLDR "quantity". I don't see "quantity" with this meaning anywhere in ICU4C or ICU4J (QuantityFormatter uses a different meaning of the word).

We should probably follow ICU4X naming here, as the meaning of their term is more implicitly clear.

eemeli avatar Jul 08 '25 08:07 eemeli