icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Rethink Datagen config file

Open sffc opened this issue 2 years ago • 16 comments

Currently it is JSON which is not super great as a config file format because it doesn't support comments.

ICU4C uses Hjson. There is a Rust Hjson crate that needs a new maintainer. I have contributed to the Hjson project before and would be willing to be that maintainer if ICU4X wants to use it.

I really don't like YAML but JSON is valid YAML so it's a forwards-compatible change to use YAML.

Or we could use TOML since Rust uses a lot of TOML. However, we'd need to make that change before 1.3 since JSON is not valid TOML.

CC @robertbastian

sffc avatar Aug 29 '23 17:08 sffc

I tried TOML but I don't think it can do enums or something

robertbastian avatar Aug 29 '23 19:08 robertbastian

JSON:

{
  "keys": {
    "explicit": [
      "core/helloworld@1",
      "fallback/likelysubtags@1",
      "fallback/parents@1",
      "fallback/supplement/co@1"
    ]
  },
  "fallback": "runtimeManual",
  "locales": "all",
  "segmenterModels": ["burmesedict"],
  "additionalCollations": ["big5han"],


  "cldr": "latest",
  "icuExport": "73.1",
  "segmenterLstm": "none",

  "export": {
    "blob": {
      "path": "blob.postcard"
    }
  },
  "overwrite": true
}

YAML:

keys:
  explicit:
  - core/helloworld@1
  - fallback/likelysubtags@1
  - fallback/parents@1
  - fallback/supplement/co@1
fallback: runtimeManual
locales: all
segmenterModels:
- burmesedict
additionalCollations:
- big5han
cldr: latest
icuExport: '73.1'
segmenterLstm: none
export:
  blob:
    path: blob.postcard
overwrite: true

TOML:

keys.explicit = [
  "core/helloworld@1",
  "fallback/likelysubtags@1",
  "fallback/parents@1",
  "fallback/supplement/co@1"
]
fallback = "runtimeManual"
locales = "all"
segmenterModels = [ "burmesedict" ]
additionalCollations = [ "big5han" ]
cldr = "latest"
icuExport = "73.1"
segmenterLstm = "none"
export.blob.path = "blob.postcard"
overwrite = true

sffc avatar Aug 29 '23 19:08 sffc

I'm not sure how to model the TOML. https://stackoverflow.com/a/57560842. Without value enums users will be able to do things like

keys.explicit = ["core/helloworld@1"]
keys.all = true
keys.none = true

robertbastian avatar Aug 30 '23 05:08 robertbastian

I think YAML is the better solution for human-friendly JSON. HSJON is niche, so users won't have linting, formatting etc. in their IDEs.

robertbastian avatar Aug 30 '23 07:08 robertbastian

I think we should spend a little more time trying to get TOML to work. These are mostly flat options, not deeply nested hierarchies.

sffc avatar Aug 31 '23 18:08 sffc

Locales, keys, and exporters are all nested, and those are the three things all users will have to set. Everyone will understand a nested JSON or YAML structure when they see it, why jump through hoops to make it work in TOML? I don't think "it's the Rusty config format" is a strong enough argument.

robertbastian avatar Aug 31 '23 19:08 robertbastian

We already have a user interface for datagen options in the form of the datagen CLI. How about making the TOML file be the same schema as the datagen CLI?

# Explicit:
keys = ["core/helloworld@1"]
locales = ["en", "ru"]

# Sets:
keys = "recommended"
locales = "full"

There's also the argfile crate which integrates directly with Clap. Each token goes on its own line.

--keys
core/helloworld@1
--locales
en
ru

sffc avatar Aug 31 '23 19:08 sffc

The point of the config file is to store things in a more hierarchical format, not to put the flat list of CLI arguments on disk.

robertbastian avatar Aug 31 '23 19:08 robertbastian

I think there's a philosophical question to be answered here. We have two datagen APIs, one with nested Rust structures and one on CLI. Now we're building a third, the config file. There is no universally agreed upon mapping between Rust structures and JSON, though there are some conventions that the serde_json crate has established. CLI options are themselves already a serialization format for these options. So, an argument could be made that the config file format should model the CLI instead of the Rust API.

sffc avatar Aug 31 '23 19:08 sffc

I also think there's a case to be made that the CLI options are more understandable by average i18n users. Datagen is a tool that clients will need to use even if they aren't Rust programmers. It's easier to understand the CLI than it is to understand either the Rust API or the somewhat-esoteric JSON that gets generated from it.

Putting datagen CLI options flat in an args file is not very elegant from an API design point of view, but it is extremely easy for people to use and figure out.

sffc avatar Aug 31 '23 20:08 sffc

The CLI options are bad, the API is too complex to model as a flat list of options. There are many options that are only valid in combination with/absence of other options, and in JSON we could model this. We also have baggage like FS exporter being called dir, baked exporter being called mod, --syntax being a suboption to FS exporter, etc.

robertbastian avatar Sep 01 '23 10:09 robertbastian

I still think we can get decent TOML behavior by making pub struct Config use flat options. Maybe even just re-use args::Cli to reduce maintenance cost.

There are many options that are only valid in combination with/absence of other options, and in JSON we could model this

Not completely. It's still valid to write the following in JSON; it will fail in deserialization, but we can make the TOML do that too, or just fail when processing the options:

{
  "export": {
    "blob": {
      "path": "blob.postcard"
    },
    "baked": {
      "path": "baked"
    }
  },
}

Where I'm not sold is that although serde_json establishes a convention for handling dataful enums, the syntax itself does not know about it. We can get behavior that is the same or better by just tweaking the Config struct to suit whichever syntax we land on.

sffc avatar Sep 04 '23 21:09 sffc

Discussion:

  • @zbraniecki I'd prefer TOML or HJSON
  • @sffc It's annoying that we basically have three APIs
  • @robertbastian config is tricky to document
  • @sffc can use the clap argfile thing
  • @robertbastian why not just save the CLI invocation to a bash script?
  • @sffc windows has a command limit of ~4000 characters
  • @robertbastian can we solve this when people actually run into this problem
  • @zbraniecki there's clap_serde
  • @sffc that solves a different problem, it defines the clap app itself

Conclusion: Make the current config a hidden non-positional argument in 1.3, and remove it from it documentation

Consensus: @robertbastian @sffc (@echeran)

robertbastian avatar Sep 14 '23 18:09 robertbastian

Some additional notes from discussion with @epage, @Manishearth:

  • No really sensible solution exists for generating a config file from a CLI or vice-versa, although there are some ecosystem crates that attempt it (don't remember the names). Best to just design your own.
  • Argfile is a choice. It doesn't appear that it supports comments (neither the microsoft nor the python argfile syntax).
    • However, argfile has a fairly straightforward plugin interface so we could write our own argfile parser that skips lines prefixed with # for example. (This is fine: the stock argfile parser is 4 lines of code, literally content.lines().map(move |l| crate::Argument::parse_ref(l, prefix)).collect())
  • When you have a config file, you need to be careful about who wins conflicts between the CLI and the config file, a fairly nontrivial problem. A common use case here might be the --verbose flag.
    • I think Argfile makes this a little more straightforward because you can list them out as @args.txt CLI arguments that Clap automatically picks up.

sffc avatar Sep 14 '23 19:09 sffc

Hmm, another option here could be that our argfile plugin could itself be a TOML parser where we parse freeform keys and values, like this:

# args.toml
a = "b"
x = ["y", "z"]
$ icu4x-datagen --foo bar @args.toml -v

# after processing:
$ icu4x-datagen --foo bar --a b --x y z -v

sffc avatar Sep 14 '23 20:09 sffc

Putting this on backlog; awaiting additional user feedback on how we can make this better.

sffc avatar Nov 02 '23 17:11 sffc