coreutils
coreutils copied to clipboard
Starting with localization
TL;DR: I want to add a new util for locale generation and provide locale-aware functionality in uucore
uutils is currently following the C locale for most of its operations and the locale settings of the system are mostly ignored. This has led to issues and PRs like these:
- https://github.com/uutils/coreutils/issues/3584
- https://github.com/uutils/coreutils/issues/3123
- https://github.com/uutils/coreutils/issues/2149
- https://github.com/uutils/coreutils/issues/3132
- https://github.com/uutils/coreutils/issues/1872 (much locale-related missing functionality)
We've mostly been putting this off due to missing libraries in Rust, but recently, this has changed with the release of icu4x. It covers many of the things we need like locale-aware datetime formatting, locale-aware collation, etc..
However, it requires data to operate on, which is different from the usual data generated by locale-gen and friends (if I understand correctly). There are essentially 2 viable ways to include data with icu4x[^1]:
- Store a blob on the filesystem to read at runtime (
BlobDataProvider). - Encode the data as Rust code included in the binary (
BakedDataProvider).
Since we don't know up front what locales we might need, I think we need to use the BlobDataProvider and allow the user to generate their own locale data on command. So, I propose we do the following:
- Add a new util, called
locale-genor something similar- This util downloads and stores the locale data in a global directory (I'm not sure where, could also be controlled by an environment variable).
- This util would be a wrapper around the
icu_datagencrate[^2]. - It could also read from system config files and install any necessary locales based on the system config automatically.
- Since this util needs access to the internet, we will run into similar issues like we did with
uudocback when it automatically downloaded examples, so it needs to be optional.[^3]
- Create locale-aware functionality in
uucoreas much as possible, so that the utils themselves don't have to bother with checking the right environment variables, loading the icu data, etc..- For example, to check the collation locale, the
LC_COLLATE,LC_ALLandLANGenv vars need to be checked. - For the utils, we then just expose a
sort/collatefunction that checks (and caches) the locale and performs the correct collation.
- For example, to check the collation locale, the
- Change the utils to use the locale-aware functions provided by
uucore.
Do you see any problems with this approach? Are there alternatives we should explore first?
[^1]: They also have FsDataProvider which is meant for development only.
[^2]: This crate also has a CLI, but we need to tailor it for use with coreutils, by setting nicer defaults for our purpose.
[^3]: icu_datagen uses reqwest, which will lead to similar problems as in https://github.com/uutils/coreutils/pull/3184
There is also rust_icu, which is a wrapper around ICU4C, which works without additional datagen, but it's a big C dependency. So I guess we have to choose between C code or custom datagen?
I'm no longer sure rust_icu works without datagen. icu4c also has a different data format from POSIX. I think this only future-proof way forward is to embrace icu4x's data format. I wonder if the Unicode folks are willing to spec out some standard location for this data and provide some tools for managing it. It'd be nice if all applications build using icu4x that want to store the data in the filesystem could share their data.
I was running into essentially the same problem for my own command line tools.
- Did you figure out a standard location to store the data?
- What about translations, icu4x seems to handle everything except for LC_MESSAGES? Or am I missing something?
- Could you consider putting the logic for locale env parsing, etc in a separate crate rather than uucore, so other projects outside of uutils can reuse it (without copy pasting code)? It would be good to be able to solve this for all sorts of POSIX command line tools rather than reinvent the wheel every time. Especially with proper support for mixed locales (as you are considering it seems, and I use, but few others care about it).
Did you figure out a standard location to store the data
Not yet. We should start talking to some people about that :)
What about translations, icu4x seems to handle everything except for LC_MESSAGES? Or am I missing something?
Translations are out of scope for a while for us I think, but if you want it, I think Project Fluent is the gold standard there.
Could you consider putting the logic for locale env parsing, etc in a separate crate rather than uucore, so other projects outside of uutils can reuse it (without copy pasting code)?
If there is a significant amount of code, it should definitely go in a separate crate.
Especially with proper support for mixed locales (as you are considering it seems, and I use, but few others care about it).
Yeah I think we should support mixed locales. At least, if by mixed locale you mean that for example collation is done in one locale and number formatting in another or something like that. icu4x can do all of that I believe.
Yeah I think we should support mixed locales. At least, if by mixed locale you mean that for example collation is done in one locale and number formatting in another or something like that. icu4x can do all of that I believe.
Exactly. I use LC_MESSAGES in English (for searchability and because translations tend to be poor), but I use sv_SE.UTF-8 for everything else, except for collate where I prefer C.UTF-8 for case sensitive sort.