icu4x
icu4x copied to clipboard
Rethink Datagen config file
Currently it is JSON which is not super great as a config file format because it doesn't support comments.
ICU4C uses Hjson. There is a Rust Hjson crate that needs a new maintainer. I have contributed to the Hjson project before and would be willing to be that maintainer if ICU4X wants to use it.
I really don't like YAML but JSON is valid YAML so it's a forwards-compatible change to use YAML.
Or we could use TOML since Rust uses a lot of TOML. However, we'd need to make that change before 1.3 since JSON is not valid TOML.
CC @robertbastian
I tried TOML but I don't think it can do enums or something
JSON:
{
"keys": {
"explicit": [
"core/helloworld@1",
"fallback/likelysubtags@1",
"fallback/parents@1",
"fallback/supplement/co@1"
]
},
"fallback": "runtimeManual",
"locales": "all",
"segmenterModels": ["burmesedict"],
"additionalCollations": ["big5han"],
"cldr": "latest",
"icuExport": "73.1",
"segmenterLstm": "none",
"export": {
"blob": {
"path": "blob.postcard"
}
},
"overwrite": true
}
YAML:
keys:
explicit:
- core/helloworld@1
- fallback/likelysubtags@1
- fallback/parents@1
- fallback/supplement/co@1
fallback: runtimeManual
locales: all
segmenterModels:
- burmesedict
additionalCollations:
- big5han
cldr: latest
icuExport: '73.1'
segmenterLstm: none
export:
blob:
path: blob.postcard
overwrite: true
TOML:
keys.explicit = [
"core/helloworld@1",
"fallback/likelysubtags@1",
"fallback/parents@1",
"fallback/supplement/co@1"
]
fallback = "runtimeManual"
locales = "all"
segmenterModels = [ "burmesedict" ]
additionalCollations = [ "big5han" ]
cldr = "latest"
icuExport = "73.1"
segmenterLstm = "none"
export.blob.path = "blob.postcard"
overwrite = true
I'm not sure how to model the TOML. https://stackoverflow.com/a/57560842. Without value enums users will be able to do things like
keys.explicit = ["core/helloworld@1"]
keys.all = true
keys.none = true
I think YAML is the better solution for human-friendly JSON. HSJON is niche, so users won't have linting, formatting etc. in their IDEs.
I think we should spend a little more time trying to get TOML to work. These are mostly flat options, not deeply nested hierarchies.
Locales, keys, and exporters are all nested, and those are the three things all users will have to set. Everyone will understand a nested JSON or YAML structure when they see it, why jump through hoops to make it work in TOML? I don't think "it's the Rusty config format" is a strong enough argument.
We already have a user interface for datagen options in the form of the datagen CLI. How about making the TOML file be the same schema as the datagen CLI?
# Explicit:
keys = ["core/helloworld@1"]
locales = ["en", "ru"]
# Sets:
keys = "recommended"
locales = "full"
There's also the argfile crate which integrates directly with Clap. Each token goes on its own line.
--keys
core/helloworld@1
--locales
en
ru
The point of the config file is to store things in a more hierarchical format, not to put the flat list of CLI arguments on disk.
I think there's a philosophical question to be answered here. We have two datagen APIs, one with nested Rust structures and one on CLI. Now we're building a third, the config file. There is no universally agreed upon mapping between Rust structures and JSON, though there are some conventions that the serde_json crate has established. CLI options are themselves already a serialization format for these options. So, an argument could be made that the config file format should model the CLI instead of the Rust API.
I also think there's a case to be made that the CLI options are more understandable by average i18n users. Datagen is a tool that clients will need to use even if they aren't Rust programmers. It's easier to understand the CLI than it is to understand either the Rust API or the somewhat-esoteric JSON that gets generated from it.
Putting datagen CLI options flat in an args file is not very elegant from an API design point of view, but it is extremely easy for people to use and figure out.
The CLI options are bad, the API is too complex to model as a flat list of options. There are many options that are only valid in combination with/absence of other options, and in JSON we could model this. We also have baggage like FS exporter being called dir, baked exporter being called mod, --syntax being a suboption to FS exporter, etc.
I still think we can get decent TOML behavior by making pub struct Config use flat options. Maybe even just re-use args::Cli to reduce maintenance cost.
There are many options that are only valid in combination with/absence of other options, and in JSON we could model this
Not completely. It's still valid to write the following in JSON; it will fail in deserialization, but we can make the TOML do that too, or just fail when processing the options:
{
"export": {
"blob": {
"path": "blob.postcard"
},
"baked": {
"path": "baked"
}
},
}
Where I'm not sold is that although serde_json establishes a convention for handling dataful enums, the syntax itself does not know about it. We can get behavior that is the same or better by just tweaking the Config struct to suit whichever syntax we land on.
Discussion:
- @zbraniecki I'd prefer TOML or HJSON
- @sffc It's annoying that we basically have three APIs
- @robertbastian config is tricky to document
- @sffc can use the clap argfile thing
- @robertbastian why not just save the CLI invocation to a bash script?
- @sffc windows has a command limit of ~4000 characters
- @robertbastian can we solve this when people actually run into this problem
- @zbraniecki there's clap_serde
- @sffc that solves a different problem, it defines the clap app itself
Conclusion: Make the current config a hidden non-positional argument in 1.3, and remove it from it documentation
Consensus: @robertbastian @sffc (@echeran)
Some additional notes from discussion with @epage, @Manishearth:
- No really sensible solution exists for generating a config file from a CLI or vice-versa, although there are some ecosystem crates that attempt it (don't remember the names). Best to just design your own.
- Argfile is a choice. It doesn't appear that it supports comments (neither the microsoft nor the python argfile syntax).
- However, argfile has a fairly straightforward plugin interface so we could write our own argfile parser that skips lines prefixed with
#for example. (This is fine: the stock argfile parser is 4 lines of code, literallycontent.lines().map(move |l| crate::Argument::parse_ref(l, prefix)).collect())
- However, argfile has a fairly straightforward plugin interface so we could write our own argfile parser that skips lines prefixed with
- When you have a config file, you need to be careful about who wins conflicts between the CLI and the config file, a fairly nontrivial problem. A common use case here might be the
--verboseflag.- I think Argfile makes this a little more straightforward because you can list them out as
@args.txtCLI arguments that Clap automatically picks up.
- I think Argfile makes this a little more straightforward because you can list them out as
Hmm, another option here could be that our argfile plugin could itself be a TOML parser where we parse freeform keys and values, like this:
# args.toml
a = "b"
x = ["y", "z"]
$ icu4x-datagen --foo bar @args.toml -v
# after processing:
$ icu4x-datagen --foo bar --a b --x y z -v
Putting this on backlog; awaiting additional user feedback on how we can make this better.