How do people check in CI that en.ftl is complete
Unlike gettext messages, fluent messages don't have a guaranteed fallback (I'm not counting the fluent ID).
fish shell wants to have all messages available in English (which will be the unconditional fallback).
Their CI will probably check this by
- extracting the set of used fluent IDs from Rust sources via a proc-macro
- parsing
en.ftlto get the set of fluent IDs with English translations - checking that both sets of fluent IDs are equal
Step 1 is a bit unorthodox, since there doesn't seem to be a clean way of extracting used IDs, especially in parallel compilation scenarios.
I suppose they could instead have en.ftl be the single source of fluent IDs,
and guard every use of fluent IDs from Rust with a macro that checks (at compile time) that the given message ID is actually contained in en.ftl.
This can be done with include_str("en.ftl"), like done here, but probably build a static set of fluent IDs (using the phf crate).
The downside of this alternative is that it doesn't allow finding unused entries in en.ftl.
I guess that could be mostly remedied with grep...
I'm curious what other people use, since this seems like a common requirement.
cc @danielrainer
The paradigm that we use at Mozilla (stale info tho) is that you don't want your source code to be the source of truth. You want your source ftl files to be that.
Yes. It creates friction by requiring developers to put on localizer hat and populate some ftl file rather then plastering English strings in source code but it is exactly what you'd do in Web UI design placing CSS classes. You'd go to your css file. Add a class, and then bind it to elements in your HTML. We suggest the same. This model makes sure you don't have to scan sources, extract strings from code, etc. Of course you can still use some extractor to populate your ssot and then commit it but your ssot should be ftl files, not _("my English strings"). That also incentives intelligable and humane l10n IDs rather than autogenerated slugs.
your ssot should be ftl files
Yes, that's clearly how Fluent is designed and we plan to use it this way. At runtime, we use the functionality provided by fluent-rs to parse ftl files and to use them for localization. This works well and does not require extracting anything. However, there are no static checks. If we use a message ID which does not exist in the ftl file, or forget to specify a variable, localization will unexpectedly fail at runtime. This issue is about how to best avoid this using static checks spotting such problems ahead of time.
At the moment, we have a Rust macro which takes a message ID, as well as an arbitrary number of key-value pairs specifying variables and their values. When compiling normally, all this macro does is building FluentArgs from the key-value pairs, which are then used to call FluentBundle::format_pattern together with the message ID, for a bundle we select depending on the user's language settings. That's the boring part, which is not relevant for this issue. When we add an extra Cargo feature flag, the message IDs are passed to a proc-macro, which allows us to automatically build a set of all message IDs we use. Then, we compare them to the set of message IDs in our English ftl file and require that the sets be equal. We could expand this approach to also take into account the variables, but that would require a more complicated proc macro and more complex ftl parse tree analysis, which we have not implemented yet.
Our approach works, but it's not very elegant, especially because we currently write one file per message ID in our source code during ID extraction to avoid race conditions from multiple writers to the same file. It also requires significant custom tooling, which is impractical to re-implement for every user of Fluent.
For these reasons, we are checking here if other Fluent users have implemented similar functionality, since it would be beneficial to share this effort, instead of every Fluent user having to decide between not having these static checks in their project or reinventing/reimplementing them for that project.
Tangentially related, there are several other static checks which can be done just on ftl files, such as enforcing certain formatting, checking whether message IDs appearing in one ftl file also appear in another (we want the message IDs of every ftl file to be a subset of the ones of the English ftl file), or showing which message IDs are not yet present if a file (useful for letting translators know what to work on). It would also be nice for developers to be able to have a way of renaming message IDs or variables with the updates automatically being applied to the ftl files for all languages. For a project like fish, which is transitioning from gettext, automating the conversion from PO to ftl files would also save a lot of effort and reduce the potential for mistakes. IDs and variable names would of course have to be chosen manually, but the rest can be automated. I started building several tools aiming to address these issues, discussed in https://github.com/fish-shell/fish-shell/discussions/12123. Also here, most of what we want for fish seems relevant for other projects using Fluent as well, so sharing ideas and implementation effort seems prudent.
You could wrap your retrieval function in some wrapper/macro that in debug mode checks if a given l10n-id is available in registered FTL files.
Doing this statically, as suggested in the issue description, would work without having to extract anything from our sources, but has the problem that we won't be able to detect messages which are no longer in use, in addition to losing the ability to check for duplicate IDs. (We don't do that at the moment, but might want to in the future.)
If we do it dynamically, as in checks only happen at runtime when a particular message is retrieved, we'd need tests covering every possible message retrieval, which is impractical and wouldn't give us anything we wouldn't get via the static variant.
Maybe I misunderstand your comment and you have something else in mind.
That's a good point.
My intuition is that to capture the whole delta you need to scan for all l10n-id uses, which should be still much more scalable than trying to scan for all _("English String") uses (well, assuming you'll go for regex, you can also parse Rust AST I guess).
At runtime I recommend assuming that missing strings are not errors and the fallback offered by your l10n resource manager is correctly executing the fallback.
I can't find the reference discussion but generally I think you're touching on design weakness of Fluent - the separation between missing l10n identifier as an error vs. as acceptable UX degradation is not well defined. Fluent assumes that fallback is preferred and executes it lazily until it runs out of locales (sync or async, your call). You can instrument error reporting to catch fallback instances, but that results in each missing string in any locale being reported as an error.
Since partial locales are supported and are encouraged (for example, es-MX should have only strings different from es and fallback to it) what you likely really want is to filter for source locale (likely en-001 or en in your case) and catch errors in that, but this is, as you pointed out, more costly, and happens at runtime.
To statically catch delta between FTL files and source code, you likely need to follow what other linters do for CSS or even JS - list of available ids, list of requested ids in source code, delta. There's no free lunch I guess :(
for example, es-MX should have only strings different from es and fallback to it
That sounds nice. We've had a discussion about computing a distance metric between specified language and available translations so we can determine the fallback order, but we'd still need to decide a cutoff distance, so it doesn't seem trivial.
Putting this heuristic in a shared crate should make it much easier. Not sure if this is in scope for fluent-fallback
Yeah, I'm currently working on icu4x locale-negotiation which is the last piece before I can move fluent-rs to icu4x stack for lcoale+negotiation. The new logic will allow for distance calculation and optionally use CLDR distance weights.
At runtime I recommend assuming that missing strings are not errors and the fallback offered by your l10n resource manager is correctly executing the fallback.
Yes, the way we implement it is that we allow users to specify a precedence list of languages. At runtime, we then go through this precedence list, returning the message in the first language where localization succeeds. Implicitly, English is always last in this catalog. For English, localization should always succeed, which is why we want static checks to ensure that this is actually the case, preventing us from runtime panics or broken/useless fallback messages when localization fails.
To statically catch delta between FTL files and source code, you likely need to follow what other linters do for CSS or even JS - list of available ids, list of requested ids in source code, delta.
Yes, that's what we thought and implemented, but we wanted to see if someone has come up with a better approach than what we do now.
Yeah, I'm currently working on icu4x locale-negotiation which is the last piece before I can move fluent-rs to icu4x stack for lcoale+negotiation.
Sounds interesting. The main conceptual issue we have been discussing, and have not found a good solution for, is how we can simultaneously allow users to specify arbitrary language precedences while still having fallback between language variants. E.g., most users who specify es-MX as their first choice would probably be fine with messages from es if no Mexican variant is available so automatic fallback from es-MX to es would be good for these cases. However, it this fallback does happen automatically, users who want messages from es-MX but not from es have no way of indicating this preference, unless there is some user-configurable option which explicitly disables fallback between language variants.