typst icon indicating copy to clipboard operation
typst copied to clipboard

Consider Fluent for localization

Open tv42 opened this issue 2 years ago • 2 comments

Right now, typst seems to contain hardcoded localization for English and German:

https://github.com/typst/typst/blob/045a109600fa9127d22259287bbde62234cadb44/library/src/meta/figure.rs#L89 https://github.com/typst/typst/blob/045a109600fa9127d22259287bbde62234cadb44/library/src/meta/heading.rs#L143 https://github.com/typst/typst/blob/045a109600fa9127d22259287bbde62234cadb44/library/src/meta/outline.rs#L179

and so on. This doesn't seem like a great way forward.. And I'm saying this as a person who's native language is something different & rare, and whose local paper requirements at least in the 90s were.. particular... about the localization.

I wanted to bring this project into your awareness: https://projectfluent.org/ & https://github.com/projectfluent/fluent-rs

With Fluent, you could have Typst users provide a mylanguage.fluent file that would allow them to localize these keywords. Fluent files for the common languages could be embedded in Typst, and behave like it does today. And Fluent would take care of any weird edge cases some languages might have, you'd just ask to evaluate template for label-figure, with figure number 21, and it would give you the right text, even if the language were to insist on e.g. funny number formatting. I don't know if Typst has any need to localize counting expressions, but you really don't want to figure out the rules of pluralizing things in various languages, it's not pretty.

tv42 avatar Mar 23 '23 16:03 tv42

Yes, the current thing is obviously not going to cut it. We'll look into it!

laurmaedje avatar Mar 23 '23 20:03 laurmaedje

Just to save you the trouble of research here (having implemented Fluent localizations for SILE and others), Fluent is the only really good way to go right now. In the medium-term future MessageFormat2 might be an alternative after it comes out of draft status, but even if you want to migrate to that later migrating from Fluent will be easier than any other current system because they are so similar. For the time being nothing even holds a candle to it as far as expressiveness and allowing the right balance between what bits programmers control and what bits language experts control. As a typesetting engine there may be relatively few places where this would matter, but do yourself a favor and use the best tools out of the gate when you do tackle localization.

There are several crate options with higher and lower lever tooling. I don't know this project well enough to suggest what route to take, only that as for as localization ecosystems you want for best results Fluent is better that the alternatives.

alerque avatar Mar 23 '23 21:03 alerque

I believe the number of language support PRs are increasing each passing day[^prs], so maybe we should really start discussing this.

I would volunteer to implement this, but I have no experience on adding i18n or even working with Fluent, so maybe I'm not the right person for it. But I could help in other ways.

What do you think, @laurmaedje and @reknih ? If we do, maybe it would be best to postpone merging this kind of PRs.

[^prs]: #565, #548, #525, #483, #481, #467, #413 ...

ararunaufc avatar Apr 04 '23 13:04 ararunaufc

IMHO merging the assorted language PRs is fine; when you do add support you can refactor all the strings collected from contributors to FTL files at that point. But yes this should be addressed soon because it will add tools that make it possible to do more than drop-in-replacement strings per language.

Since this app probably doesn't want to ship FTL files, we'll need some way to embed resources. Does the project already have a Rust crate or hand rolled system in use for embedding asset files in the binary? The answer to that question will affect with Fluent libraries it makes sense to start with.

alerque avatar Apr 04 '23 13:04 alerque

I also think we should more seriously consider a way for users to localize the strings, but I have some concerns about complicating our lives with fluent. Current use case of localized strings with the LocalName trait doesn't use or need pluralization. Furthermore I don't think that pluralization will be needed in the foreseeable future, since googling latex pluralization (or something similar) doesn't return anything useful. If LaTeX doesn't need an advanced localization system like Fluent, I don't think that any typesetting system needs it.

Additionally I would like to raise an ease of use concern. A common use case of custom localization strings in LaTeX is to change how the caption is displayed on figure, table and similar elements. For instance a user wants to display the figure caption as Fig. 1 instead of Figure 1. In LaTeX the user would only need to redefine the \figurename. If we decide to use an advanced localization system like fluent, the user would need to learn a whole new syntax and work with additional files. This makes something as simple as changing the caption of a figure a much more difficult task in Typst than in LaTeX, which should probably raise some red flags.

I propose that we define a LocalNameKey enum that lists all the keys used for translations (figure, table, bibliography, heading, . ..). In addition to that we should replace the LocalName trait with a Translator trait which provides a method that can return a local name when given a LocalNameKey and Lang. Inside the code we can define a default translator implementation that works similar to the current implementations of the LocalName. Translator should always be available on the TextElem, the same way as Lang is currently available. Lastly we should define a Typst function that a user can call to override a translation inside the default translator. The usage of the function should be something like setlocalname('sl', 'figure', 'Slika'). This way we would add a simple way to add a new language support in Typst, since all the translation strings would be in one location and not scattered across all the elements that implement LocalName. We would also add a simple way for users to translate the strings without editing the code, and change the captions in their document to be something different than the default ones.

viddrobnic avatar Apr 04 '23 18:04 viddrobnic

The caption part is being worked on (seems like a large update regarding that was merged recently), but I agree one should be able to change all other local strings from within typst, as that would enable anyone to adapt each string to their needs.

PgBiel avatar Apr 04 '23 18:04 PgBiel

@viddrobnic Just because Fluent has advanced features as far as giving potential control to translators doesn't make "advanced" some derogatory term. It doesn't have to be hard for end users to use! In fact I'd wager defining Fluent messages is way easier for most use cases than redefining a LaTeX command. And believe me, I've done lots of both. I've been typesetting multi-lingual documents in LaTeX for over two decades, and still do it every week. I'm also the maintainer of a typesetting engine and have written a Fluent library from scratch for it.

We use Fluent in SILE for localizing all strings being typeset and it works great. Want to change the automatic header in the ToC from "Table of Contents" to something else? Just set the Fluent message inline somewhere in your document before the ToC:

\ftl{tableofcontents-title = Map of Chapters}

...or alternatively provide an FTL file with your project that redefines any/all of the strings you want to override.

This can be used to support a language we don't have defined or just to do silly things like use different words than we picked by default.

Pluralization isn't the only use case for needing actual localization tools. The use case can be as simple as needing no reverse the chapter number string. For example the default localization in SILE for chapter headings in Turkish is:

book-chapter-title = Bölüm { $number }

... but it would be just as valid to use a different arrangement and some projects may want to specify it like this:

book-chapter-title = { $number }. Bölüm 

Since both ways are equally valid whatever way we default to somebody will want it different, and a simple key/value map makes this cumbersome (whether in an input data table or setting variables or whatever) and the LaTeX approach of redefining commands is much more complicated than giving people a true localization function like Fluent, however many time you call the latter "advanced".

You will also run into more and more hard cases as language support grows. Right now this project has been very western-language centric and picking off low hanging fruit with the localizations. That's not meant as a criticism of the project as it has other strengths, but I'm just stepping it to suggest trying to implement something "simple" for the sake of avoiding "advanced" tools will come around to haunt you later when it is hard to change to the right tooling you should have used in the first place.

alerque avatar Apr 04 '23 20:04 alerque

@alerque I completely agree that advanced is not a bad thing. I also agree with you that there might be time when the more advanced features are needed and that we should prepare for that. However I disagree that the right way of future proofing is to go with the most advanced option right out of the gate. In my experience the saying "premature optimization leads to hell" is true.

Before continuing I would just like to comment on my simplicity concern. The original comment might be poorly worded, so let me try again. One of the biggest reasons why I am excited about Typst is the simplicity of it. It is the first alternative to Word that I came across and can recommend to my friends and family (who are not programmers) without getting the "I am not a programmer, I can't use this" response. So I strongly believe that a simple way of doing things must be kept. In the case of localization there should be a way to provide a custom translation of a string without ever hearing about Fluent. I think that providing a simple function that I have proposed is one way of doing it. It is also very similar to the table of contents example that you have given.

Let's get back to the "premature optimization leads to hell". From the open issues it seems to me that most of the current localization needs of the users can be satisfied with a simple key/value map. During my time of working with startups I have learned that it's usually better if first a simple feature is added, and only if there is still a need for a more advanced feature, that feature is added as well.

As stated before, future compatibility is a must. I think that my proposal is very much future compatible. From the user's side, they would first get a function that allows them to provide a custom translation of a string. If Fluent is added in the future, this function should still be kept. This way nothing breaks with an addition of Fluent support, and there is still a way for users to provide custom translations without knowing about Fluent.

From the programming side my implementation proposal is also future compatible:

  • Translator trait stays the same, as well as all the elements that use it.
  • Default implementation of Translator stays the same (if pluralization is still not needed for default strings).
  • The "override with simple function" implementation of Translator can stay the same and can be based on a key/value implementation.
  • An additional implementation of Translator which uses Fluent has to be added. This implementation should also include a Typst function that allows users to import custom Fluent files.

In this proposal, implementations of Translator should work with each other by embedding one another. For instance the override translator embeds the default translator. If the override translator doesn't contain a queried key, it returns what the default translator that it embeds returns. This way we can nest the translators as needed.

In summary we can satisfy most of the users with a simple implementation that can be done in a single afternoon, and add Fluent in the future when needed, without breaking anything for the user or the programmers.

PS. My comments might come across as me not wanting Fluent in the project. This is definitely not the case and I know that some users might need it. I just believe that this sort of features should start of simple and go from there when and if needed.

viddrobnic avatar Apr 05 '23 05:04 viddrobnic

I'm sorry I hear what you are saying but it sure sounds like you haven't actually used Fluent. It too can be implemented in an afternoon—or less depending on who does it.

What I hear you suggesting is that you want to write a custom function that your "non programmer" friends can call to inject translation strings. I fail to understand how rolling your own solution and exposing it in a new custom-built function that people have to learn would be easier on either the people implementing it or the people using it. I am suggesting is using an existing library exactly tailored to the job and use it expose an something much closer to a key-value store: fill in the id you want to change and the new message you want to change it to. Fluent messages are just keys and message strings. You can use it as simply as a JSON lookup table if you want, but when you do need place-holders or context sensitive variations the tooling will be there. You won't be able to bolt that onto a custom-rolled function easily at all.

As I see it the up-front cost is roughly equivalent, the end user experience can be just as simple, and the foundation for future expansion is much sounder.

alerque avatar Apr 05 '23 08:04 alerque

You are right, I haven't implemented Fluent in any of the projects. If it can be implemented just as quickly as a key/value store, then of course we should use Fluent.

Regarding a function for basic users: I never said that it must be based on a key/value store, it can definitely use Fluent in the background. The feature that I would like to see is a helper/shortcut function that allows a user to change a single translation string without having to know they are using Fluent in the background. I am pointing this out, because the initial request for fluent only mentions specifying translations in a separate file. This probably also comes down to how the docs for localization will be written. They should start with using a simple helper/shortcut function and only in the advanced section mention that localization is actuality done with Fluent and how to take the full advantage of it by creating a separate translations file.

viddrobnic avatar Apr 05 '23 15:04 viddrobnic

I think your concerns are completely orthogonal to Fluent vs. something else as an i18n system. There should be no reason you can't have what you want with Fluent being used under the hood.

The feature that I would like to see is a helper/shortcut function that allows a user to change a single translation string without having to know they are using Fluent in the background.

Sure, you want a language function that can be used inside documents to reset a translation. No problem.

At the end of the day Fluent files are just a key-value map: they consist of a message ID (key) and the message (value). Both are just strings. No matter how simple or complex the message is, it can always be expressed as just a string, so you can easily come up with a function for the typst input language that sets these key/value pairs. Under the hood the terminology will be "loading a message into the resource"—a fluent resource being a collection of messages for a given locale.

The message string can optionally include special markup like placeholders or include variants for different contexts. Part of the reason Fluent is better than other choices in this domain is that it is up to the translator not the programmer to figure out how to handle variations between languages. In regard to your concern about a document needing a simple way to set a new message for a given ID, this means that the person providing the new string can decide whether just a hard coded output is sufficient or whether to make use of more advanced features to give different strings for different situations.

This is different from most localization systems where it is up to the programmer writing the code to "pick" the right translation/variant/etc. This makes it incumbent on the translators to always provide exactly the same "shape" of message(s): if the programmer that wrote the code expected a singular variant and a plural variant, a language that doesn't need variants at all is still forced to give both strings and a language that needs different variants for 1-2, 3-4, and 5+ is just out of luck.

Of course the normal way to provide a localization for a whole app would be to have one FTL resource file per locale. I would expect for typst's use case that these data files would be to have them as separate files in the source repository but compiled into the binary (embeded) at build time so the resources were just available with no support files, and then documents would also be able to use whatever helper function is provided in the markup language to load new messages replacing the default messages if they want to—or optionally provide a FTL file of their own to be included as a resource.

Right now when somebody comes along and wants to contribute a new language upstream, they have to sort through half a dozen completely different places buried in Rust code to do so! In the future just coping the most similar FTL message resource file to a new language name and filling in the strings would suffice. Hence it isn't just document authors but potential contributors that would find an "advanced" localization system easier to use than a home-grown "simple" solution.

alerque avatar Apr 05 '23 18:04 alerque

@viddrobnic , I understand your concerns with making the tool hard to use, but you mentioning that Fluent shouldn't be considered yet is what was causing a dispute.

Please notice that we don't just need pluralization. The i18n needs go from this to proper formatting for numbers, currencies, dates, etc., collation info for properly ordering strings (needed for bibliography listing), hyphenation rules, case-folding, and some more. (Some of it is already dealt with.)

The actual user-facing side of this being simple or not is a matter of proper planning and a good design. We don't have to make users go all the way down into the weeds for every usage.

From my point of view it makes a lot of sense to make the simple case simple and leave an escaping hatch for the complicated or intricate cases. So, we might as well just use Fluent behind both options.

I don't know if the "premature optimization" reasoning applies easily in this case... It usually has to do with situations where you don't know or can't know if they'll be needed, but in our case it will definitely be needed. It's a matter of when, not if.

ararunaufc avatar Apr 06 '23 05:04 ararunaufc

Also, "pluralization" in LaTeX is usually dealt with ad hoc, exactly because there is no good way of handling that on it. But also, it's usually fine because when you're making cross references (for instance) you actually know what you're referring to and hot many of that there are.

cleveref was (is?) a package made to deal with automatic "naming" and "pluralizing" cross references, but it just deals with the "one/many" kind of pluralization, making it of no use for some languages.

ararunaufc avatar Apr 06 '23 05:04 ararunaufc

I think I was a little misunderstood. My intention was not to suggest that Fluent should not be considered. Rather, I was wondering if there might be a more efficient way to achieve the current user requirements and then expand it later with Fluent. If implementing Fluent is just as easy as any other method, then of course there is no reason not to consider it.

Additionally, I think I may have been misunderstood in regards to my concern for ease of use. As you mentioned, it is important to make the simple case simple. I never said that a helper function shouldn't use Fluent under the hood. I have actually stated the opposite:

Regarding a function for basic users: I never said that it must be based on a key/value store, it can definitely use Fluent in the background.

My main concern is with how the implementation is presented to the user. I have already said that this probably just comes down to how the localization documentation is writen. For most users, the basic case should be shown as setting a custom translation without mentioning pluralization and Fluent. For more advanced users, there should be a section explaining that Fluent is being used under the hood and how to take full advantage of it.

I believe we all agree on what should be done. I propose that we conclude this unproductive discussion and instead focus on how to implement this feature effectively.

viddrobnic avatar Apr 06 '23 18:04 viddrobnic

We don't currently have the need for fluent, so I'll close this. If we do need more capability later, we can consider it.

laurmaedje avatar Jul 15 '24 19:07 laurmaedje