icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Implement MessageFormat

Open sffc opened this issue 2 years ago • 25 comments
trafficstars

I couldn't find an issue to track the ICU4X implementation of MessageFormat, so I'm making one now.

We should implement MF2 syntax, of course, but we also will likely want to support MF1 syntax, at least as a conversion step.

@zbraniecki @mihnita

sffc avatar Jan 25 '23 21:01 sffc

Hello @sffc , is there a recent plan to implement this?

ghost avatar Jul 10 '24 06:07 ghost

@echeran is going to look into this in Q3.

sffc avatar Jul 10 '24 17:07 sffc

@sffc @echeran Thank you! Please keep me in the loop since this is the main feature we are looking for from your library!

ghost avatar Jul 12 '24 05:07 ghost

Btw I deleted my work account and merged it to my personal account, so now the author of my past comments is ghost. I'll follow up using this new account (we still care about this feature!), and you can reach out to me via chat, email, anything just like before.

jiyuntu avatar Oct 04 '24 03:10 jiyuntu

@echeran design doc: https://docs.google.com/document/d/1X5qiEK1swGYMblwbBcOu1HkiZUbL4yhnekVyzPPQo-o/edit?tab=t.0#heading=h.r76nofm7sof

WG discussion:

  • @sffc As a general framework, enums are smaller and faster than trait objects.
  • @zbraniecki I think the MF data model is a perfect example of the architecture I've been pursuing in ICU4X for a long time, which is that you have a reference AST and a runtime AST, and those ASTs are different. I think we're in a position to add a DSL for MF tooling in the future. You're instead looking at a data model for the runtime execution of MessageFormat. But, neither of these examples are optimal for "serialize, deserialize" which is what tools want to do. So "format" and "edit" are different use cases. But, there is likely some overlap between the two data models.
  • @zbraniecki Have you look at fluent rs?
  • @echeran Not completely
  • @zbraniecki Ok, because fluent rs is an implementation of a syntax similar to MF2, including a serializer, etc.

https://github.com/projectfluent/fluent-rs/blob/main/fluent-syntax/src/ast/mod.rs

https://github.com/projectfluent/fluent-rs/tree/main/fluent-syntax/src/parser

https://github.com/projectfluent/fluent-rs/blob/main/fluent-syntax/src/serializer.rs

https://github.com/projectfluent/fluent-rs/tree/main/fluent-bundle/src/resolver

  • @zbraniecki We spent multiple years with Manish and Stas making sure this is performant. So I think we should look at something like that. What you might find is that Manish designed a model in it that accomodates what I think you will encounter in MF2: a customer may want to add custom functions.
  • @Manishearth About enums vs traits: enums are generallky better, but traits let people implement their own thing. So we could potentially make an enum where one variant is a trait object; you don't hit the branch unless you are actually using this. And it's best with a trait object if you can borrow and avoid the destructor (dyn Drop). We could even feature-gate.
  • @hsivonen We should probably invite Eemeli to review this. Do you have an idea of when this would be in a stage for Eemeli to review this?
  • @echeran Maybe another week so I can think about APIs for it.
  • @zbraniecki If you want to look at the lessons learned from writing the Fluent parser and resolver, so that it handles both roles of reference and runtime, the trick Manish came up with is a slice: you generalize everything over a slice.

https://github.com/projectfluent/fluent-rs/blob/main/fluent-syntax/src/parser/slice.rs

Here's how we do customer attached functions - https://github.com/projectfluent/fluent-rs/blob/main/fluent-bundle/src/entry.rs#L13

  • @Manishearth It's been a while since I worked on this, like 7 years. We could apply some of these principles, but things have changed and we might tailor the choices to what works best for ICU4X.
  • @zbraniecki I agree
  • @zbraniecki My main point is that I think you should design this so that it works well for the tooling ecosystem.
  • @echeran It makes sense that in Java, you'd use interfaces with dynamic dispatch, but that's not what Rust is optimized for. And the line of "is the user customizing this or not" seems good.
  • @echeran As far as the Fluent way of doing things: I don't know if we need an enum variant for custom items in general; we need them for functions. It seems you don't need the enum if you have the trait.
  • @echeran I'm focused on the runtime part of it. Manipulating the AST is nice to have. I have questions about that, which veers off the main focus.
  • @Manishearth Functions are the main point of customizability. I don't think AST manipulation is that good... the performance... the ease of working with enums in Rust ends up better here. About the functions, you could keep a global function ID in the enum and avoid the dynamic dispatch / trait object entirely.
  • @sffc There are a few ways we can do custom functions. All trait, enum with trait, global registry, function pointer. We should continue to iterate on that.
  • @zbraniecki I think it's worth looking at how Fluent solved custom functions. It is a good starting point. By my recollection, we spent the most time going back and forth on slices. There are questions about how to represent escaped Unicode sequences.
  • @sffc It may be worth exploring whether we can flatten the attributes and options into the main vec of parts. For example, pub enum Expression can get an Option and Attribute variant that mean, "add this option or attribute to the previous function". Then we can use ZeroVec<Part>, and Vec<Part> has a very cheap Drop impl. These are concrete implementation benefits, but they should be weighed against the tradeoffs.
  • @zbraniecki I don't think we should complicate the model or be non-Rusty.
  • @sffc I push back on that. I don't think my flattened model is non-Rusty. It diverges from MF 2.0, so it is not useful as a reference AST.
  • @zbraniecki I think the trifecta is functionality, maintenance, and performance. We can make performance optizations and measure them with the great tooling we have in Rust.
  • @sffc We have a lot of experience doing measurements over the last 5 years and we know the things that have the biggest impacts on performance. Those things include avoiding trait objects, avoiding drop impls, avoiding branching, avoiding runtime allocations. So I think we should just go straight to the solution rather than spending engineering cycles measuring every little thing.
  • @zbraniecki I disagree with that because I think the performance characteristics of MF are different than the performance characteristics of e.g. datetime data.

sffc avatar Dec 05 '24 20:12 sffc

I didn't say much during the actual discussion and had to drop at the end of the meeting, but on the topic of the AST structure in broad strokes I lean towards Zibi's position: the performance characteristics of MF are going to be different from things like datetime, and attribute-like things are not a thing we've had to deal with in datetime, we have lengths and those are stored with the fields.

I do not think it is a given that flattening attributes will be a win here, because it will complicate everything else. It's less about Rustiness and more just about what is most efficient for processing, and I'm not convinced that a flattened model will be efficient.

That said, flattening attributes simplifies a lot of things about writing easy zerovec impls.

Those things include avoiding trait objects, avoiding drop impls, avoiding branching, avoiding runtime allocations.

A non-flattened model does not imply any of these things IMO. Zerovec is powerful enough to support both a flattened and a non-flattened data model. The non-flattened one is a bit harder to do but possible.

Manishearth avatar Dec 06 '24 00:12 Manishearth

Flattening produces substantially smaller and faster Drop impls. With nesting, you have to iterate over the whole message every time you Drop. I've seen this before in other components.

This principle applies as much to MF as to any other component. I'm especially concerned about MF code size because it can bloat quickly.

It also makes ZeroVec impls easier, which may not be as applicable to MF.

sffc avatar Dec 06 '24 04:12 sffc

Evidence for my claim:

Program 1:

#![crate_type = "cdylib"]

#[no_mangle]
pub fn drop_vec_of_usize(v: Vec<usize>) -> usize {
    std::hint::black_box(v).capacity()
}

ASM: 18 instructions. No loop or branching except for a single je that checks if the vec is empty.

Program 2:

#![crate_type = "cdylib"]

use smallvec::SmallVec;

#[no_mangle]
pub fn drop_vec_of_smallvec(v: Vec<SmallVec<[u8; 4]>>) -> usize {
    std::hint::black_box(v).capacity()
}

ASM: 44 instructions. Includes a loop over every element. Lots of branching with je.

https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=33c414ac172d1a4826d249a73ea9899f

sffc avatar Dec 06 '24 07:12 sffc

That example seems irrelevant? I'm not debating the performance characteristics of nested Drop.

Flat vs non flat can both be done in a pure ZeroVec world (which is desired, yes?), and arbitrarily nested ZeroVec structures still only have a single Drop call.

Manishearth avatar Dec 06 '24 15:12 Manishearth

I'm assuming that most messages come in at runtime and will be parsed at runtime via Message::try_from_str, and I am coming into this with the frame of mind that constructing a VZV at runtime is a non-starter for normal usage of ICU4X.

That said, it would certainly be nice to have a way to make them zero-copy for clients who want that behavior. Making a message backed by a ZeroVec, which is cheaply runtime-constructed, seems like a reasonable way to do that.

sffc avatar Dec 06 '24 18:12 sffc

I also think there may be a difference in values.

What I see as goals and non-goals:

  • Goal: Produce the fastest MessageFormat 2.0 parser and formatter on the planet.
  • Goal: Produce the smallest MessageFormat 2.0 parser and formatter on the planet.
  • Goal: Behavior is fully conformant with the spec and passes all tests.
  • Nice-to-have: Support statically constructible messages.
  • Nice-to-have: Support use cases other than parsing and formatting.
  • Non-goal: Align the ICU4X implementation with the ICU4C or ICU4J implementation (though they can of course be used as inspiration).
  • Non-goal: Be the first to market. (This was initially a goal of mine, but ICU4C already beat us.)
  • Non-goal: Code avoids making assumptions that could change if the spec changes. (What I mean is, we should be able to do things like "assume there are only X possible expression types" and "assume variables cannot be re-defined" and so forth, if it helps us write faster or smaller code. If the spec does change, we may need to invest additional engineering cycles to adapt to those changes, but that's why we have a supposedly stable spec.)

sffc avatar Dec 06 '24 19:12 sffc

@sffc Ah, you are talking about a runtime parseable AST, I was under the impression you were talking about the data model for stored messages since you mentioned ZeroVec. We'd probably want both types eventually?

For runtime parsing there's a lot more in this design space available to us to hit the goals you mentioned, and the AST doesn't need to be flattened for that. For example Attributes could be interned into a hash set[^1], making the actual attribute type Copy, though dependent on a context object to be read. Could even use a precompiled perfectly hashed set of expected strings as supported by crates like string_cache.

During yesterday's meeting I mentioned interning strategies as being likely to bring great wins for parsing performance and memory usage, I do think that's going to be the right route for the handful of Droppy things scattered throughout.

Moving Vec<Attribute> out and flattening doesn't make expressions Copy since they have other non-Copy fields. If we want the benefits of Copy vectors we're going to have to intern the names anyway, and if we're paying the complexity cost for that, interning more things is less of an issue. The evidence you supplied shows what happens when you take a nested vector and replace it with a vector of Copy types: something will need to be done here to make the vector element type Copy and handling attributes is only a small part of it. Flattening attributes doesn't even make it Copy since those attributes still have Drop!

So if we want to have performance discussions about flattening we should first align on what the rest of the type looks like.

I absolutely think that we should use what we've learned from ICU4X when it comes to Rust performance, but I'm not convinced that the techniques we've settled on for ICU4X data are the right ones for a runtime AST: flattening makes a lot of sense for datetime patterns for a whole bunch of reasons that are not applicable here.

[^1]: The trashmap crate which I maintain provides a quick way to do this, though there are a bunch of different ways to do this well.

Manishearth avatar Dec 06 '24 23:12 Manishearth

@echeran Can you help me understand which of these is expected to have the same value show up in the code multiple times, and which of these are not? Trying to understand the feasibility of interning strategies:

  • Literal strings (I imagine these are typically unique)
  • Variable references (probably normal to show up multiple times)
  • Attribute names (probably normal to show up multiple times, and may draw from a small, predefined set)
  • Attribute values (often going to be from a small, predefined set?)
  • Expression names (probably from a small set?)
  • Attribute lists (I'm not sure about this one: would one have the same attribute list show up multiple times in code or would it typically be differing values)?

Manishearth avatar Dec 06 '24 23:12 Manishearth

Moving Vec<Attribute> out and flattening doesn't make expressions Copy since they have other non-Copy fields.

No? Which fields? I'm assuming that all strings, including variable and attribute names, are borrowed from the message.

The rest of your post doesn't make sense to me because I don't understand why the Part can't be Copy if we eliminate the nested Vec.

interning strategies

Do you mean, like, storing attributes in a different field of the Message and pointing to them from the main message parts vec? That might be ok. But I want to avoid global caches.

sffc avatar Dec 07 '24 00:12 sffc

I'm assuming that most messages come in at runtime and will be parsed at runtime via Message::try_from_str, and I am coming into this with the frame of mind that constructing a VZV at runtime is a non-starter for normal usage of ICU4X.

Most messages will come as variable length lists of messages that have to be parsed at runtime, so just to start you have a variable length list of messages, and each message has a variable length list of parts. Then you go deeper.

We can reason that large portion of messages are going to be a list of messages that are single part, so we can optimize the structures for that, but we need to be able to, in this list, to branch off to messages that will have variable number of variants, with variable number of parts inside each variant, with variable number of attributes in some parts of each variant of some messages.

I'd like to avoid premature optimizations until we have the basic model working and I'd like to suggest that we do evaluate each optimization against benchmarks of a simple "hello-world" application using MF2 to localize 10-20 messages.

I'm bringing it up in writing because I think that some of the assumptions built into Shane's position come from using "one message" as a starting point, which I think is not representative for how localization lists work.

zbraniecki avatar Dec 07 '24 01:12 zbraniecki

Do you mean, like, storing attributes in a different field of the Message and pointing to them from the main message parts vec? That might be ok. But I want to avoid global caches.

Potentially. There are a lot of different ways to intern things, not all of them involve pointers, you can also have indexes, or hashes.

It wouldn't be a global cache, it would be a context object the message either borrows from or is expected to be used with during resolution.

(The former requires the use of elsa or other append-only collections, the latter is a bit more brittle but has the benefit of not having weird borrowing behavior)

No? Which fields? I'm assuming that all strings, including variable and attribute names, are borrowed from the message.

I think that's one way of doing things, yes, and that makes Copy easier, assuming there are no escape sequences. It's not the only way to do things. There's also a cost to keeping the source text around that may need to be reckoned with for large databases of strings. There are a bunch of tradeoffs here.

As a data point most compilers tend to not keep the source text around and instead use string interning strategies for stuff like this.

Manishearth avatar Dec 07 '24 01:12 Manishearth

The rest of your post doesn't make sense to me because I don't understand why the Part can't be Copy if we eliminate the nested Vec.

Oh, also, any strategy that unconditionally borrows from the original string will have issues with escapes in literal strings. Which can be handled, but then you need to do some parsing at print time. A lot of the optimizations you can do on the AST will have performance implications downstream. It becomes a matter of if it's parse-once format-once, or parse-once format-many-times for any given message (and I don't fully understand the usage patterns we can expect).

Manishearth avatar Dec 07 '24 01:12 Manishearth

Fair point about escape sequences. It seems like it's not too terribly expensive to decode those at format time, although it's certainly not as nice as a memcpy for a chunk of string. I can see now why you were thinking about interning the strings.

sffc avatar Dec 09 '24 17:12 sffc

Here's a fluent-rs unicode escaping function that we can look to micro-optimize further.

zbraniecki avatar Dec 09 '24 18:12 zbraniecki

MessageFormat deep dive 2024-12-27

... (missed the first 10 minutes)

https://docs.google.com/document/d/1X5qiEK1swGYMblwbBcOu1HkiZUbL4yhnekVyzPPQo-o/edit?tab=t.0#heading=h.5rulnmr2njpb

Escape sequences

  • @zbraniecki You need to think about how to handle Unicode escape sequences
  • @Manishearth When you borrow a string, you borrow the string exactly as in the source code, including the escape sequences. If you parse and allocate a string, you generally resolve escapes. There are a few architectures: unconditional borrowed that defers escaping, unconditionally parsed that front-loads escaping, Cows that do one or the other, etc.
  • @zbraniecki I wouldn't recommend we only look at Fluent, but what the Fluent parser does well is it elegantly handles the problem of borrowed vs owned of strings. It allows you to avoid compromises. Originally you were asking about enums with manual dispatch vs dynamic dispatch via traits. What you can end up with is a zero-allocation parser, except for Unicode escapes.
  • @sffc For now, I think the main design consideration is that we shoudl support borrowing string. As for when and where we do the Unicode escape resolution, and how exactly that works, there are a few options on the table for that. For now, we should not design it in a way that we can't borrow strings for messages. I feel that Unicode escapes are designed to not needed to be very common. The syntax was chosen to avoid escaping by using pipe delimiters, etc. Having escapes would be in the cold path, and not having any escapes would be the common path. We should have borrowing and a lifetime, and we should keep the option open for unescaping. All of the options involve having a lifetime. There are different ways to do this.
  • @Manishearth I was thinking of adding a flag. The way I see this is deferring escaping is fine. You parse a string/message and format it once. The main cost is, when you're parsing it, you need to go look for escsapes, and then when you're formatting it, you need to look for escapes. A type called MaybeEscsapedString that's a boolean between an escaped string and an unescaped one, then we already have a type that we can turn into a generic if/when we need to convert the AST into a modifiable one.
  • @zbraniecki - Here we have a manually written escape->unescape and unescape->escape - https://github.com/projectfluent/fluent-rs/blob/main/fluent-syntax/src/unicode.rs

Implementation options of parser result of parsing message

  • @echeran You could have a model that gives a "DOM-based" (nested) or "string-based" (linear token stream) structure. They have pros and cons. In MessageFormat, we could do either of those types of approaches. There are a variable number of branch arms and attributes and things, which require allocations in the nested approach. The token stream approach needs extra types.
  • @zbraniecki A few questions. I'm not familiar with Option 2 for the parser implementation that is stream based. I see the argument that you could have a token stream out of the input. There is a concept of LP(1) and LP(2). Some languages require a different number of ___ to understand the meaning of a token. I'm not sure what MF2.0 ended up being. If you want to avoid any allocation, it means that you are not recording any state. If you want to view something twice, you have to parse it twice.
  • @Manishearth It's in-between. And we're avoiding nested allocations, not all allocations. The message is parsed and tokenized, but it is structured into a stream of things rather than a nested structure. It depends on how jumpy message formatting is. I believe message formatting is not very jumpy: you walk forward.
  • @zbraniecki If this is a simple message, it may be made of a few tokens. If there are 3 or 4 variants in a select message, you have to stream to the end of
  • @Manishearth Can you give an example?
.input {$count :number}
.match $count
one {{You have {$count} notification.}}
*   {{You have {$count} notifications.}}
  • @zbraniecki So, let's say you start at the top. You parse the .input and the .match. Let's say you have type "other". How do I know where I need to go next?
  • @Manishearth We had a little discussion about this before.
  • @zbraniecki So I have to stream all of the selector values and variants to know which variant branch to select, and then I have to restream the message to find and parse the selected variant pattern.
  • @Manishearth There are two ways to represent this:
# Nested:
 - input (count number)
 - match count
     - one
         - you have {count} notif
    - star
        - blah

# Manish Linear:
- INPUT (count, number)
- MATCH (count, 2)
- INDEX 0
- INDEX 4
- ARM one
- you have
- VAR count
- notif
- ARM star
- you have
- VAR count
- notifs


# Shane Linear:
- input ($count, :number)
- match ($count)
- arm ("one", Option<3>)
- literal
- placeholder ($count)
- literal
- arm ("*", Option<3>)
- literal
- placeholder ($count)
- literal
  • @Manishearth Producing the correct variant/arm for the linear model
  • @sffc My version of the linear model is pretty similar. If you're parsing directly from the message, then you don't have to calculate the number of tokens for each arm (variant). The only reason that I have the number 3 to indicate the number of tokens, which is optional, is that allows you to skip ahead to the next arm, but it is not strictly necessary.
  • @Manishearth @sffc's model is similar to my model in terms of performance. The main difference between... The cost of processing a match:

Nested:

  1. load arm locations (vector allocation)
  2. load arm1's discriminant
  3. check
  4. load arm2's discriminant
  5. check

Linear (Shane / manish):

  1. load arm1's arm location
  2. load arm1's discriminant
  3. check
  4. load arm2's arm location
  5. load arm2's discriminant
  6. check

Linear without arm lengths (Shane with Option = None):

  1. load arm1's discriminant
  2. check
  3. loop a. load next entry b. check if arm
  4. load arm2's discriminant
  5. check
  6. ...

Discussion:

  • @Manishearth This cost is not a huge deal in the linear model. These values will be in cache anyways. The benefit of not allocating will outweigh the cost of the additional checks in the linear model (true in both the @Manishearth / @sffc versions of the linear model)
  • @zbraniecki Are the benefits (1) prematurely parsing variants that are not selected, (2) co-locating variants in memory?
  • @Manishearth We're still parsing everything.
// Zibi Variant

struct SelectMessage {
  meta: ..., // Vec
  variants: Vec<Variant>,
}

struct Variant {
  qualifier: Qualifier, // Vec<Argument>
  value: Value
}

enum Value {
  Parsed(ParsedValue), // Vec<ValueElements>
  UnparsedValue(&str/String)
}
  • @zbraniecki My argument here is that, unless we can measure and show performance differences, this type of message means that variants can be lazily parsed. I'm not sure that there will be a noticeable difference in runtime performance for a select message between the nested parser result model vs. the stream parser result model.
  • @Manishearth I think a core contention is, what we expect the most common path to be. Is it, you parse and format a message once, or you parse and format multiple times? Things like not prematurely parsing variants is a good optimization for parse-and-format-once, but not for formatting multiple times. I think we should talk about the usage patterns explicitly. If the design of MF2 means that avoiding premature parsing is easy, that's good and we could try that. We could possibly use multiple AST designs. We probably need 2 for zero-copy.
  • @Manisearth Earlier someone said matches cannot be nested. But they can be in series? Our problem is nested allocations. I was wondering if we could do an HIR-style design:
// Manish's "HIR" model with side tables
struct SelectMessage {
  inputs: Vec<Input> // Input is Copy. Maybe a Litemap?
  stuff: Vec<Value> // Stuff is Copy
  variants: Vec<Variant>,
    
  // Potentially?? And then attribute-having AST entries contain a Range<usize>.
  attributes: Vec<Attribute>?,
  // Potential zerovec-friendly linearized version:
  // variant_values: Vec<Value>,
  // variant_patterhs: Vec<(&str, usize)>,
}

enum Value {
    Literal(&str), // Followed by attributes?
    Variable(&str), // Followed by attribute
    Match,
    Markup(MarkupKind), // Followed by attributes
    FunctionCall(FunctionCall), // Followed by attributes, then Value operand (NOT the reverse)
    Attribute(Attribute),
}

struct FunctionCall {
    name: &str,
    operand: &str,
}

struct Attribute {
    name: &str,
    value: &str,
    
}

struct Variant {
   match_values: Vec<Value>
   pattern: &str
}
  • @Manishearth The main thing we want to avoid is a vec of 30 elements, each of which has a destructor. But if we can flatten this a bit, we reduce that cost. I'm not sure about how to handle attributes.
  • @sffc Are there multiple .match in a message?
  • @zbraniecki No, there is a single match with multiple dimensions.
  • @sffc Going back to @zbraniecki's comments regarding the performance of deferred parsing, I'm not convinced that there is much benefit to deferring the parsing of a variant. you can't avoid tokenizing the message because you stil have to know where the curly braces occur. Turning that into the linear odel is very cheap. The linear model is a slightly more useful version of the raw token stream.
  • @Manishearth In general, parsing tends to be more expensive than tokenizing. There is value there (in parsing), but we shouldn't go down that route yet. The HIR model gives a clear place where unparsed values go.
  • @sffc To comment on the HIR variant, it alleviates my concerns about nested allocations, There are still some concerns on how it would be modeled in the ZeroVec universe. We can probably work on this a little bit more, and see if Variant can not be nested. If we can linearize match values... I'm still thinking about the question is the usage pattern to parse and format a message at the same time? and is that going to be common? Or do we parse and store ahead of time, and then format later? I implemented the icu_pattern crate to have something similar to the linear model. Both the nested model and the HIR model require more manipulations and calculations during parsing, whereas the linear model is cheap to calculate from the raw token stream. Also, with the linear model, you only need one model for both parse-and-format and save-your-parsed-model-and-format. icu_datetime also has one model serving both use cases.
  • @Manishearth I think parsing is costly either way.
  • @zbraniecki We can really quickly generate a minimally viable AST for the different options that we're proposing. I think we're now past the point where debating further is worth it before we implement and compare. If we are successful with MF2.0, then the variety of use cases that it will be used for will exceed the use cases we're discussing here. For Firefox, we had 15000 messages. On some screens, we had 2000 messages. It depends on the container. My understanding of how Google handles localization, there is a JSON container that stores messages that are keyed by an identifier, and you fetch the messages that you need using those identifiers. The use case for a message that gets reused is something that dynamically changes, ex: a dialog that indicates the remaining time to download. It constantly gets updated with new interpolated values. I think I can prove that there is a benefit of not parsing every variant of every message, similar to how there's benefit to not parsing every message of a message resource file. If we can use memchar to scan the beginnigs of every variant and ... , then that would prove the performance benefit. That can be benchmarked using criterion or iai.
  • @Manishearth We're not talking about the message bundle here. My general take for eagerly vs lazily parsed values, with the right design we can figure this out later. So far, all of the designs listed, you make an UnparsedValue variant. I don't think the HIR model requires parsing a message multiple times.
  • @sffc I think there are two main reasons for looking at the linear model over the nested model: 1) the nested drop impls neede for allocation 2) unifying ASTs between the different call patterns. With the comment about parsing being more expensive, you have to store the parsed result somewhere in memory, whereas the linear/streaming model doesn't require storing the result. But I think the HIR model is strictly better than the nested model.
  • @Manishearth With the HIR model, we could possibly pull attributes out.
  • @Manishearth The parser could be modeled to generate a stream of parser elements (higher level than tokens, but not an AST), and you could either convert that to an HIR or format it directly. So I would start with a parser that generates the parts that this thing uses.
enum ParseElement {
    Input(Input),
    Value(Value), // Ranges are all 0 in the streaming case, HIR will later fix up ranges
    Variant(Variant),
    Attribute(Attribute),
}
  • @Manishearth Streamed ParseElement is easily converted to HIR, but can also be streaming formatted, and that design flexibility lets us pick one or do both.

Parser implementation strategies

Option 1 - Write parser code by hand

Option 2 - Some type of Rust parser generator using the MF2.0 grammar in ABNF notation

  • @zbraniecki nom is a good option
  • @sffc Erik tried using nom before but it is a big dependency for the library code
  • @Manishearth The syntax seems simple enough that we should just parse it ourselves

Question on representing MF2.0 typeless function values in Rust

Proposed direction as of 2024-12: let's defer this design question for the future. In the meantime, we can work on the design and implementation of the currently clarified parts of the design.

  • @zbraniecki: here's how we did it in Fluent:
  • Custom Functions: https://github.com/projectfluent/fluent-rs/blob/main/fluent-bundle/src/entry.rs#L13-L14
  • Custom Types https://github.com/projectfluent/fluent-rs/blob/main/fluent-bundle/src/types/mod.rs#L79

Use cases other than formatting

  • @zbraniecki In pursuit of runtime performance, we're considering introducing the structure a parser produces that is not suitable for anything other than runtime parsing. Right? Like, none of these things support other use cases like serializing, syntax highlighting in CAT tools, analysis of messages.
  • @echeran I imagine it would be possible to create transformation code to go to different models.
  • @zbraniecki Are there other use cases?
  • @echeran What I'm trying to say is, it's not binary. It doesn't need to be completely separate from other use cases. This structure could be convertible to something that is better suited for that.
  • @Manishearth I think our primary use case has been formatting. For other use cases, like dev tools, I am less interested. I would love our solution to support both. But I don't want them to be our primary case. Serializing is generally not something we try to optimize in ICU4X.
  • @zbraniecki One of these models can be extended to parse, modify, serialize. Others can handle partially parsed messages.
  • @echeran We can't solve a use case like modification in the same was as formatting.
  • @zbraniecki If we can get a model that is equally fast for formatting as for other use cases, that seems like it would be the better model.
  • @sffc Regarding @zbraniecki's comment, I would not be surprised that if a full nested structure model would be on par in performance with an HIR model. What I wanted to see if a streaming parser would reduce code size, and whether it would improve performance by not having the Drop impl for allocation. It's possible that other use cases can be handled by the HIR or streaming model, and that's fine. But it's good to talk about such use cases concretely, and not in the abstract.

Meeting chat log

Zibi Braniecki 8:49 AM

Here's an example of Fluent AST - https://github.com/projectfluent/fluent-rs/blob/main/fluent-syntax/src/ast/mod.rs#L1010 notice the "S" instead of &str or String and then we have : https://github.com/projectfluent/fluent-rs/blob/main/fluent-syntax/src/parser/slice.rs

which implements Slice for &str and String that means you can have AST<String> or AST<&str> - both work

Zibi Braniecki 8:50 AM

Here we have a manually written escape->unescape and unescape->escape - https://github.com/projectfluent/fluent-rs/blob/main/fluent-syntax/src/unicode.rs

Zibi Braniecki 9:49 AM

https://github.com/BurntSushi/memchr

Zibi Braniecki 9:56 AM

https://crates.io/crates/nom

Zibi Braniecki 9:59 AM

here - https://github.com/duesee/abnf let's not use it, but it exists

Zibi Braniecki 10:01 AM

here's how we did it in Fluent:

  • Custom Functions: https://github.com/projectfluent/fluent-rs/blob/main/fluent-bundle/src/entry.rs#L13-L14
  • Custom Types https://github.com/projectfluent/fluent-rs/blob/main/fluent-bundle/src/types/mod.rs#L79

Manish Goregaokar 10:18 AM

by the way, I figured out how to do mutation in the HIR model there are some tradeoffs but it's doable via some different methods

Zibi Braniecki 10:21 AM

SIMD powered lexer - https://alic.dev/blog/fast-lexing

sffc avatar Dec 27 '24 18:12 sffc

I've got a bit more fleshed out version of the Streaming-vs-Structured multimodal design in https://docs.google.com/document/d/1VaP5ZMcQLsyxRzSJI-016-e4YEN-2IPS44bwFPgxpJM/edit?tab=t.0

Manishearth avatar Jan 01 '25 02:01 Manishearth

Detailed notes not taken. Summary:

  • @robertbastian suggests a static approach:
    • Such as what is used in Dart Intl
    • So that it works nicely with data slicing by default
    • Such as by having a language-agnostic structure; relationships between original and translations of the same message are strong, and less structure needs to be dynamically loaded
      • @Manishearth says that cross-language structure is represented elsewhere
    • Encodes that translators can't change formatting options
  • @sffc thinks it is okay to add a proc macro later to help with data slicing
  • @sffc thinks that an HIR-based approach is likely to have similar performance characteristics as other approaches such as codegen-based
  • @echeran wants a staged approach to implementation
    • With things that we already know that we need implemented, it makes sense to start there first

.input {$rank :number select=ordinal} .input {$count :number} .input {$count :number style=currency currency=USD} .input {$count :number style=currency currency=$currency}

Rob's mental model:

Raw en:

.input {$count :number style=currency currency=$currency}
.local $total  = {42 :number style=decimal}
.match $count
one {{You have {$count} money out of {$total}.}}
*   {{You have {$count} moneys out of {$total}.}}

Raw fr:

.input {$count :number style=currency currency=$currency}
.local $total  = {42 :number style=decimal}
.match $count
0   {{Vous n'avez pas d'argent.}}
one {{Vous avez {$count} argent sur {$total}.}}
*   {{Vous avez {$count} argents sur {$total}.}}

==>

Skeleton (can be available at compile time):

<has inputs called $currency, $count>
count = CurrencyFormat(<locale>, count, currency)
total = DecimalFormat(<locale>, 42)
.match $count
< <locale> pattern keyed by plural category with 2 placeholders> 

en (loaded at runtime):

one {{You have {$0} money out of {$1}.}}
*   {{You have {$0} moneys.}}

fr (loaded at runtime):

0   {{Vous n'avez pas d'argent.}}
one {{Vous avez {$0} argent sur {$1}.}}
*   {{Vous avez {$0} argents sur {$1}.}}

@manishearth - this design would need to be extended to support different locals in different languages

Rob:

make_skeleton!(MyMessageSkeleton, "message format code for english goes here", my_custom_fn(...) {});

struct MyMessageSkeleton {
    CurrencyFormatter,
    DecimalFormatter,
    input information
}
impl MyMessageSkeleton {
    type DataPack: Deserialize;
}

let fr_pack: MyMessageSkeleton::DataPack = ...;

let skeleton = MyMessageSkeleton::try_new_unstable(
    &currency_provider,
    &decimal_provider,
    fr_pack,
    "fr"
)?;
println!("{}", skeleton.format("CAD", 7));

Shane:

// Different ways of making a function registry:
// 1. Compiled data:
let message_specific_function_registry = icu::message::pre_process!("<message in any language, known at compile time>");
// 2. Buffer data:
let message_specific_function_registry = icu::message::pre_process_with_buffer_provider!("<message in any language, known at compile time>", &buffer_provider);
// 3. Not specific to a single message:
let omni_function_registry = icu::message::DefaultRegistry::new();
// 4. Not specific to a single message with buffer provider:
let omni_function_registry = icu::message::DefaultRegistry::try_new_with_buffer_provider(&buffer_provider);

struct AnyMessageFormatter {
    hir: MessageHir,
    formatters: Map<PlaceholderId, Box<dyn Formatter>>,
}

// Built message formatter (allocates the formatters):
let fr_message = icu::message::Message::try_from_str("<message in french>")?;
let fr_message_formatter = icu::message::AnyMessageFormatter::try_new(locale!("fr"), &fr_message, &message_specific_function_registry)?;
println!("{}", fr_message.format(options));

// Streaming message formatter (no allocations):
let fr_message = icu::message::Message::try_from_str("<message in french>")?;
println!("{}", icu::message::try_format_now(locale!("fr"), &fr_message, &message_specific_function_registry, options)?);

// Built message formatter: (Manish)
let fr_message = icu::message::Message::try_from_str("<message in french>")?; // or deserialize
let fr_message_formatter = SpecificMessageFormatter::try_new(locale!("fr"), &fr_message, &message_specific_function_registry)?; // SpecificMessageFormatter is generated by the proc macro
println!("{}", fr_message_formatter.format(options));

Manishearth avatar Feb 11 '25 12:02 Manishearth

Discussion with Eemeli on ICU4X-TC approving the spec , at a WG meeting.

  • @eemeli Hi, I've been working on the MF2 spec under CLDR. That is now at a stage where it is a Final Candidate, and we would like to officially finalize it in the CLDR/LDML 47 release that is upcoming. Effectively, there's been a concern raised from the ICU-TC about wanting to have more time to get their API for this implemented and reviewed. To accomodate that, the MF WG is proposing (final decision from CLDR TC)... we think that everything is pretty-much ready, and if any parts need to be held back at Final Candidate, the easiest way to do this would be for the set of functions... to mark most of them as Proposed rather than Final, and then to finalize the spec with only the functions :string, :number, and :integer as finalized, and later on, to hopefully publish the rest of the functions as final. As far as I know, I'm here to look for approval from this TC for this plan of action.
  • @echeran I've been in touch about the concerns from ICU-TC. They were largely about, what happens when you implement the spec in a statically-typed language, and the technical requirements to implement these types of things. There are differences on how the C++ implementation works vs Java. The C++ implementation in particular is a bit more complicated. The other aspect here is, I'm also in the ICU4X WG, and I'm working on implementing a design that we're coming up with together for MF2 in ICU4X. So I don't know yet how these types of concerns will play out, but I predict that the C++ complexities required to implement MF2 might be part of the coding work in ICU4X. So, the idea that we're at the finish line and we're done, we just rubber stamp, ... I think that's glossing over a lot of things.
  • @Manishearth I wanted to understand, is the goal of this discussion to go through the technical details and say what we like and don't like about them? I've looked at the MF spec from an implementation perspective. I definitely have questions about some things. I don't know if that rabbit hole is useful here. I also recognize that we're late in the process, and I don't want to open something up late.
  • @eemeli If there is implementer feedback regarding the spec, or questions I might be able to clarify, that would be a useful use of my time. I'm not aware of the MFWG having received feedback from the ICU4X TC thus far. If there are concerns from here, that would be best heard ASAP, because we are close to the point where we think we are done.
  • @echeran It's good to clarify what the concerns were. @eemeli's position originally was that we just go ahead and declare the API versions as final, and we weren't ready to do that, right?
  • @eemeli We're at a situation where we are... the CLDR TC is making the call here... I'm not sure if any of us here are on that TC, but they are still taking input from within the Unicode orgnaization. The MFWG is saying this thing is done, and we've had it open for review for a while, and I'm not aware of the MFWG or the CLDR-TC having heard anything from the ICU4X-TC. So if there are concerns that this TC has regarding the MF spec, they would like to hear about that.
  • @echeran An important caveat for what you just said is, yes, the spec has been available for review for awhile, but it's also been a moving target. Being in MFWG, I can attest to that.
  • @sffc It would be good if we could have had feedback written up and posted asynchronously. I'm happy about the changes that you've made to the function registry, which addressed a lot of concerns. We have not formally given a position the ICU4X-TC's concerns. We would like to get out our concerns at a high level, and then give a more detailed recommendation. I would like to do that in this meeting. Also, we are in the early stages of implementing MF2. @Manishearth will share some of the early feedback. I don't know what else we will encounter during implementation. A position we could state, whether or not CLDR-TC will follow, is that ICU4X-TC would like to see an implementation finished before the MF2 spec is finalized.
  • @Manishearth First thing I'll do... I have a design doc for how the MF pipeline could be represented in ICU4X. Note: terminology in the doc comes from the website https://messageformat.dev rather than using the MF2 spec terminology.
  • @Manishearth The concrete feedback was: attributes seem to be a bit confusing, and from an implementation perspective, it would be useful to know how often we expect them to be used. It would be good to know the use cases, so we can know how likely users are to use them. Maybe they are supposed to be a general-purpose extension mechanism. The attributes complicate the AST in a way that introduce performance tradeoffs we need to make. This is more a request for clarity than feedback, though our feedback would be that it would be easier if attributes were accepted in multiple places.
  • @Manishearth The document lists all of the requirements: parse-once-format-once and parse-once-format-many. @zbraniecki has mentioned being able to modify the AST as something he's interested in. We're thinking about that but not seriously unless/until a use case is there. As far as developer tools, that is not a use case we're considering in scope.
  • @eemeli You should be able to discard attributes during formatting because they are not to hace any impact on formatting. They are intended to communicate information to translators, etc.
  • @Manishearth Including can copy on formatting?
  • @eemeli Yes.
  • @Manishearth So then the only variable-length thing in the AST are function arguments?
  • @eemeli Function options.
  • @Manishearth So overall, the amount of nesting in this syntax is, functions can have options, and if you wish to treat markup as nesting, you can, but you don't have to. So it's a very flat level of nesting. And for formatting, you don't need to stick attributes everywhere.
  • @Manishearth My other feedback was, it seems like you only support one match in the whole thing. Is that a concrete desire?
  • @eemeli There will only be up to one match statement, but that statement might have multiple variable references in it.
  • @sffc There is the u:locale attribute. I commented that it might be problematic.
  • @eemeli It is not an attribute. It is an option that can apply on every function, and rather than passed on to every function handler, they can be used to change the function locale (and siimlarly to change the bidi text direction).
  • @sffc I opened an issue on whether that complicates our implementation. Without the u:locale attribute, there is one type of locale that is required to format a message. But now with it, now we are required to parse the message at runtime to know the locale to format.
  • @Manishearth The way that ICU4X handles things is that you create an object with a locale, and that object informs how you format everything. That allows you to at compile time to only load data for the locales of interest. For MF, you would create a MessageFormatter object, and then you would format it with runtime args, etc. But if now there is a rqeuirement for multiple locales specified in this way, that puts a wrench in the works of our design in the way things work for ICU4X.
  • @Manishearth In general, there seem to be 2 ways we could fail. It's possible a message has a custom function that wasn't in the registry. But it's also possible that the message has this u:locale thing going on. We have this concept of a "message skeleton" that describes what data a message needs.
  • @eemeli First of all, I'm noting that u:locale and u:dir allows the locale to be a variable reference, which means you don't know the result until formatting time. Regarding some of the rest of what you were saying, I'd like to mention that the MF specification does not hace a concept of a source message matching a target message. It seems likely that there will be cases where the source message has a single pattern being formatted, but a target language will depend on some property to have variant patterns in it. I'd encourage you to not build in those types of expectations.
  • @Manishearth This goes into something that we discussed a little bit, although not fully. The current design, the doc linked above, is about runtime message formatting. And that would not have problems with plural selection messages because you would preload the maximal set of data you need for your locale. What it is a problem for, is supporting the use case that you know at compile time that the locale is en so you only load this data at compile time. We do this a lot in ICU4X to only inline such data. If you have a message that may need plural rules, even if the source message doesn't need it, the source message needs to formatted as a one-variant plural selection message in order to allow translators/translations to use plural rules in the target language message. It's really good to understand how this u:locale works because of the effect on how ICU4X does data loading and optimization.
  • @sffc On u:locale, I'm still not convinced that it is a very well-motivated feature, and this complicates the design of data loading. The fact that the locale can depend on an input variable completely invalidates our approach to determine the locale needed for formatting at compile time. My recommendation would be, first, to not use u:locale, but if you insist on using it, then to make the value of u:locale to be fixed.
  • @eemeli Are you talking about a limitation you'd communicate to ICU4X users, or feedback for MF WG?
  • @sffc I'm talking about feedback to the MF2 spec. The only purpose of a spec is to ensure the interoperable behavior of all of the implementations. I am making this feedback so that we would have consistent implementations across the board without making ICU4X an exception.
  • @Manishearth Ideally, we would be able to throw an error if you have ___ . We don't like workflows where users have to retry their attempts all over in the case of errors. We also don't like workflows in which data is loaded on demand. It really complicates lifetimes. ICU4C just loads compiled data, so they don't care.
  • @sffc This is a concern for ICU4X that doesn't exist for ICU4C, so it is more important for ICU4X to speak loudly about this.
  • @eemeli I want to re-iterate that this is an optional part of the spec, and it is perfectly valid and fine for the ICU4X implementation to put limitations on it. What you mentioned about throwing an exception... the vast majority of cases will do the thing you said of having source and target match in their requirements. But do note that all of the options that we support, support variable options. So when doing unit formatting, or currency formatting, all of those, you don't know necesarily at parse time all of the ways in which... all of the patterns you might need when formatting, since they depend on runtime variable values.
  • @Manishearth I can understand from a syntax point of view... that options being runtime variables is a desired feature. For the functions currently supported, is there value to options being runtime? Is that value mostly just, so that we can write the options in variables so we don't have to type them again and again? Is there value in the options changing from locale to locale?
  • @eemeli There may be. When doing number or datetime formatting, one thing to adjust the mapping of what comes out when the options bag uses the source locale. Introducing a dependency on the target locale for the formatting locale being a variable reference. One use case for using variable references is to allowing user preference, ex: having 24-hour time even when the locale is en-US.
  • @sffc For a currency formatter, you don't want the locale controleld by an input variable, because that can result in invalid formatting. That could be controlled by the currency amount. That would be an okay requirement because that is a sensible thing. I'm coming around to thinking that there are certain options in certain positions that should & shouldn't be allowed. When I mean by options are the external variables that passed in at runtime. When a thing is an external variable is infectious. There should be restrictions on which options should be allowed to be external variables.
  • @hsivonen There was a mention that a feature was optional. Is an optional feature one that would be exposed to the web, so, if ICU4X were used as a backend for a web-exposed MF2 feature, for example as a backend for Intl.MessageFormat, would the optional feature effectively become required because of that?
  • @eemeli I would be okay accepting limitations on the u:locale option at least on the first pass if that is the only restriction. But yes, if there are concerns like this, the place to make sure they are enforced is not the MF2 spec but instead the TC39 proposal.
  • @sffc A general comment is that it continues to blow my mind that MFWG thinks it acceptable that different implementations can have different behaviors.
  • @eemeli If there is feedback that you would like to give that options ought to only have literal values, or that there should be limits on those, this is feedback that is potentially coming in rather late. And this topic was discussed in the WG about 3 years ago. You should sync up internally at Google with Mihai regarding the decision to go in this direction. One alternative direction was one that allowed variable references in an operand, but also allowed multiple operands to a function, and the function options only hold literal options. But I believe feedback to allow options to only allow literal values would be a fairly big change that the MFWG might not have appetite for.
  • @Manishearth I think we're talking about two extremes, and I don't think either are necessary. I don't think that "options are free-for-all" and "options are literals" are not the only two options. The thing we are concerned about is that data affecting runtime options are runtime-dependent. Data, locale, ..., for example, in decimals, whether you want to affect the grouping, at least for some of these APIs, we can say that data loading needs to be runtime dependent. It may be possible to tweak the MF spec without much work to affect the definition of standard functions, even if not custom functions, to establish this requirement. I don't see how it's useful to allow arbitrary data loading at runtime.
  • @sffc I want to +1 what @Manishearth said. I'm not saying we should disallow all variables in options. All I'm saying is that we should disallow external variables being allowed in options. I don't think that was understood previously.
  • @eemeli Then I think the one action arising from here on you is to go through all the options on the functions in the spec and come up ASAP with a specific list of the options that you think ought to have this restriction put on them.
  • @Manishearth Aside from hour cycle and time zone, probably everything on datetime.
  • @sffc Can we do this only for the 3 functions that are proposed as finalized: :string, :number, :integer? Let's not discuss for :datetime since it's not proposed to be final yet.
  • @eemeli The thing I was going to say is noting explicitly that if you do need to leave out support for u:locale, it doesn't necesarily mean you need to leave out u:dir and u:id since the restrictions there seem different.
  • @Manishearth I'm proposing this ontology of data loading affecting options u:locale is definitely one.
  • @hsivonen Do we have a characterization of what users of ICU4X are sensitive to... this dynamic data loading, and what users of ICU4X are expected to have baked data for all locales anyway? For example, if you're supporting datetime formatting with baked data, does that change the concerns here, and is there a correlation or anti-correlation between the sort of scenarios that would want to use a MF2 backend and have data-loading-sensitive options being dynamic? Currently Firefox for datetime formatting has all the data baked in.
  • @Manishearth This might be one of the cases where u:locale is different than currency or notation or something. In baked data, you can pre-bake what you need. However, if you start loading new kinds of data, it starts impacting your binary size. That's a cost borne by all users. Right now in ICU4X, you can choose to be runtime-dependent. Whether our users who use MF care about binary size enough to do that, our stance has generally been that the value proposition of ICU4X supports these things well.
  • @eemeli The question about currency and unit formatting: the MF2 spec is encouraging usage patterns where the currency value is given at formatting time is a blob ocntaining the currency value formatting currency code. Is that problematic from a data loading point of view?
  • @robertbastian No.
  • @Manishearth Currency is a weird thing. This is the ideal use case for currencies. Currencies and units share that property. This is why we should go through and audit the APIs. The goal is not to prevent useful patterns. The goal is to prevent patterns that don't have much use in the first place.
  • @Manishearth The last question was about locals. Locals are fine... they seem like a convenience transform. Inputs are... I have some struggle understanding them. What .input does is it takes an external input and it tags it with a formatted value, so you can still do a match on the original value, but then you can use it in a string as a formatted value. It is a funky variable shadow. Is this a correct charaterization?
  • @eemeli The JS TC39 proposal for pattern extractors is kind of part of the .match protocol. The .input deifnes a function call that is applied to the named variable that comes from an external source. .input {$foo :number}
  • @eemeli The return value of the function handler :number is expected to be in MF terms a "resolved value", which is easiest to understand as an object that has capabilities like being formattable into a string or another representation, or being selectable-on.
  • @Manishearth When you call these things, you end up with 2 values. One is how it works in selection, and the way it works in formatting.
  • @eemeli A third is if you have .input {$foo :number} and later .local $bar = {$operand :number minimumSignificantDigits=$foo}. Variables can be used as operands of later functions. In the spec, an example of what this could look like.
  • @Manishearth I will need to figure out how the third value affects our design. The fact that there are multiple values will complicate the design, but not too much.
  • @eemeli Note for the third value, there is a bag of resolved option values. So if you .input {$foo :number minimumSignificantDigits=2} and then you do a .local $bar = {$foo :number maximumSignificantDigits=5} then bar has minSignificantDigits = 2.
  • @robertbastian Is it possible for different implementations can emit different errors?
  • @eemeli Yes. We were not able to define in the MFWG to determine which programming model for returning errors.
  • @Manishearth And yes, we may end up in ICU4X using multiple programming models for our implementations.
  • @Manishearth Overall, this is a decent design from what I've seen, but I'll need to take a further look.
  • @sffc We've had a lot of questions. In terms of feedback, the one thing is the impact on our data loading framework.

Conclusion:

  • @Manishearth to make a writeup about the data loading architecture and restriction we may like to put on u:locale and other options, with a focus on the functions being proposed as final in CLDR 47. The ICU4X TC will sign-off and send this as official feedback.

Manishearth avatar Feb 13 '25 14:02 Manishearth

Summary so far:

  • Initial design doc: https://docs.google.com/document/d/1X5qiEK1swGYMblwbBcOu1HkiZUbL4yhnekVyzPPQo-o/edit?tab=t.0#heading=h.5rulnmr2njpb
  • Design doc for AST representation concerns per use case: https://docs.google.com/document/d/1VaP5ZMcQLsyxRzSJI-016-e4YEN-2IPS44bwFPgxpJM/edit?tab=t.0#heading=h.mwn95cq0jjht
  • @robertbastian suggests a static approach:
    • Such as what is used in Dart Intl
    • So that it works nicely with data slicing by default
    • Such as by having a language-agnostic structure; relationships between original and translations of the same message are strong, and less structure needs to be dynamically loaded
      • @Manishearth says that cross-language structure is represented elsewhere
    • Encodes that translators can't change formatting options
  • @sffc thinks it is okay to add a proc macro later to help with data slicing
  • @sffc thinks that an HIR-based approach is likely to have similar performance characteristics as other approaches such as codegen-based
  • @echeran wants a staged approach to implementation
    • With things that we already know that we need implemented, it makes sense to start there first

Summary:

  • The design in the MF2 design doc for AST representation per use case is sufficient for beginning initial implementation work
    • The parsing of the canonical syntax is a known requirement, and thus necessary to implement even if not sufficient
    • Any solution will require the parsing of a message into something structured
  • In parallel, we can work on the design in order to satisfy requirements for compile time optimizations

LGTM: @sffc @robertbastian @echeran @Manishearth @younies

Manishearth avatar Feb 14 '25 09:02 Manishearth

Note that @zbraniecki has a 3-year-old draft PR: https://github.com/unicode-org/icu4x/pull/2272

sffc avatar May 06 '25 20:05 sffc

@sffc and @robertbastian and I discussed the path forward around compiletime APIs.

Shane noted that the MF2.0 group has acknowledged a distinction between "link time" data loading and "runtime" data loading. There is not tons of alignment in the MF2.0 group that pre-declaring a maximal set of needed data at link time is an example of best practices, but they are on board with the general goal of optimizing runtime data loading.

We are wary of having a primary compiled data design that makes assumptions not well aligned with the MF2.0 team.

As such, our first pass at an implementation will have compiled APIs that link in all data, but selectively load data based on the parsed message. We may figure out some solution for preloading data or caching loaded data.

Manishearth avatar Aug 20 '25 23:08 Manishearth