Add starlark toml module
This module is analogous to json but for TOML. It supports
toml.decode and toml.encode. This is useful for the increasing
number of places where toml is used over json / yaml, such as Cargo.toml
or uv.lock. Unlike bazel's json support this module depends on a
separate open source project for actually parsing and encoding toml
support. Most of this change is marshaling these objects between
starlark and java representations, which isn't done in the same way
anywhere else I could find.
Ha, I needed this for rules_rs and debated between adding bazel support, using toml2json wasm, and prebuilts. Ended up going with prebuilts because of wasm perf issues, but having built-in support would be neat!
Many many times have I wished for something like this in rules_rust. It would help immensely
FYI, all Google java projects that I've seen use TomlJ for their toml parsing/formatting. So we'd prefer to use that in Bazel too unless there is a strong reason for picking jackson's implementation.
FYI, all Google java projects that I've seen use TomlJ for their toml parsing/formatting. So we'd prefer to use that in Bazel too unless there is a strong reason for picking jackson's implementation.
TomlJ Doesn't support serialization:
- https://github.com/tomlj/tomlj/issues/59
And appears to be in a maintenance mode:
- https://github.com/tomlj/tomlj/issues/44
edit:
Some context, I only chimed in to provide information. I definitely want some serialization support in a potential toml module for Bazel, but I don't care what library is used 😄
yea i tried that one first actually and those where the reasons ^ are you on that version or a fork?
@UebelAndre - thanks, noted. The fact that it's unmaintained is certainly a strike against.
That said: we need a library for toml parsing, but I'm not sure whether we need a library for toml serialization. Bazel already does custom serialization for textproto (https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/packages/Proto.java) and for json. While parsing toml is a pain due to all the different string variations and ways of specifying keys, I don't think it's hard to lossily serialize (and serialization would have to be lossy anyway, since starlark doesn't support some toml data types, such as dates/times).
i did consider manually serializing before i ended up switching to this library, so if that is a path we have to go down i agree it would probably be pretty fine. obviously the hardest part is various escaping sequences
I think we need to determine what java toml libraries exist, how active / maintained they are, and how much they'd add to the bazel binary size. (And hopefully we find something that doesn't pull in java.desktop.)
yea besides binary size i generally did that search already, these are the libraries i could find, including all the ones listed on https://github.com/toml-lang/toml/wiki
- unmaintained: https://github.com/mwanji/toml4j
- primarily 1 person, low traffic: https://github.com/TheElectronWill/Night-Config
- discussed above: https://github.com/tomlj/tomlj
- pretty new, seems like low usage so far: https://github.com/WasabiThumb/jtoml
- the one i landed on as it seems to be very widely used https://github.com/FasterXML/jackson-dataformats-text
@keith Could you please take a look at the failing checks?
@tetromino ptal. I moved this back to tomlj and manually encoded things
yea besides binary size i generally did that search already, these are the libraries i could find, including all the ones listed on https://github.com/toml-lang/toml/wiki
This is an accurate analysis. jackson-dataformats-toml is a great minimal TOML implementation with adequate test coverage and wide adoption, constituting a great upgrade from tomlj should you get it to work. In terms of binary size, jtoml (my library) is about 3x larger. This is because it exposes a rich type and key system, handles comments, etc. Think Gson for TOML. It's probably overkill for your use case, but there's the relevant distinction.
The date issue is inherent to the pursuit of TOML-JSON exchange, as JSON simply lacks a date/time type. If data integrity is not paramount, date/times should probably be stringified. Otherwise, I would point out the PARSE_JAVA_TIME option which already exists in jackson-dataformats-toml. Perhaps this is something you can leverage?
@keith, sorry for delay, I have been getting a lot of pushback internally for a adding jackson-dataformats-text dep to Bazel (in part because my coworkers would be on the hook to maintain it in the google monorepo forever). So the decision to use tomlj certainly makes this PR more palatable :)
I am wondering if you have a concrete use case in mind for toml decoding. If it's just nice-to-have but not necessary for what you are doing, I'd leave it out so as to avoid any new dependencies. (For example, Starlark in Bazel has an encoder/decoder for json, but encoder-only for textproto.)
I'm not Keith but fwiw we need toml decoding for both rules_rs (rust) and rules_py to parse lockfiles coming from language-specific package managers. We are currently shipping prebuilts for limited platforms but having native capability would be very useful to us.
I think the rust + python examples are good, I think in general there are just a lot of cases where you might want some sort of config file to feed into a rule / repository rule, and the format may not be up to you, like in the case of Cargo.toml etc. So you potentially have to shell out to another language for something that would otherwise be trivial.
I think the rust + python examples are good, I think in general there are just a lot of cases where you might want some sort of config file to feed into a rule / repository rule, and the format may not be up to you, like in the case of Cargo.toml etc. So you potentially have to shell out to another language for something that would otherwise be trivial.
@dzbarsky - am I understanding correctly that you need either support in Starlark or a prebuilt binary because you're doing the decoding in a repo rule?
If so, that's a very good argument for having a decoder.
I think the rust + python examples are good, I think in general there are just a lot of cases where you might want some sort of config file to feed into a rule / repository rule, and the format may not be up to you, like in the case of Cargo.toml etc. So you potentially have to shell out to another language for something that would otherwise be trivial.
@dzbarsky - am I understanding correctly that you need either support in Starlark or a prebuilt binary because you're doing the decoding in a repo rule?
If so, that's a very good argument for having a decoder.
Yes, exactly, this work happens in module extensions and repo rules, so we either need a native solution, fast wasm support (I tried and it wasn't fast), or prebuilt binaries.
I don't want to deter progress on this pull request, but I wanted to note that I'm doing initial investigations for a pure starlark toml encoder/decoder based on python's tomli. A starlark-ified version without too many limitations could be possible.
Let me see what I can publish by end of the week/next. Hopefully the limitations are not too many/critical, and can satisfy the use cases mentioned here.
I don't want to deter progress on this pull request, but I wanted to note that I'm doing initial investigations for a pure starlark toml encoder/decoder based on python's tomli. A starlark-ified version without too many limitations could be possible.
Let me see what I can publish by end of the week/next. Hopefully the limitations are not too many/critical, and can satisfy the use cases mentioned here.
One of the drivers here is performance. As you're developing feel free to check out https://github.com/dzbarsky/wasm-repro and run the benchmarks there. The rust version was fast enough and the python one was not. (Wasm was same speed as python). I wouldn't be surprised if starlark ended up close to python.
One of the drivers here is performance.... As you're developing feel free to check out...
@dzbarsky See the proof of concept here https://github.com/willstranton/wasm-repro/tree/starlark_toml_poc
When I ran python against starlark, I got the following numbers:
$ time bazel build @tomli-starlark//...
...
real 0m0.820s
user 0m0.015s
sys 0m0.015s
$ time python3 convert.py
...
real 0m0.176s
user 0m0.157s
sys 0m0.019s
The rust version was fast enough and the python one was not.... I wouldn't be surprised if starlark ended up close to python.
Are those results too slow since you said even python wasn't fast enough? The starlark is about 7-8x python's time, but at least it's better than what I'm getting with wasm:
$ time bazel build @wasm-repro//...
...
real 0m4.976s
user 0m0.014s
sys 0m0.018s
Give it a try yourself - I'm not exactly using very beefy hardware at the moment. So I'm curious if you'd see better numbers.
@dzbarsky @keith @tetromino Do you think this is worth pursuing further? Or is the current starlark toml module (this pull requests we're commenting on) preferred?
If you think the pure starlark implementation would be useful/used, I can put it into a proper repo and clean it up.
Timing wasm isnt just timing the build, it's running the wazero binary.
From my POV if rules_rs extension evaluation goes from 100ms to 1 second that's pretty painful.
I personally son't see a good reason to invest in a starlark module (except as a temporary polyfill?) if we have agreement on native support
Timing wasm isnt just timing the build, it's running the wazero binary.
Oops, I forgot you also had it run twice, so you're right - the numbers given aren't being compared fairly. I still use the same comparative numbers below, so be aware future readers!
I checked out this pull request to time it as well (ran on a mac, so time output is different...). Built with bazel build --stamp --embed_label=9.0.0 //src:bazel (in case some optimization flag is missing).
In order of increasing time, and run 3x each are the python, tomlj (this pull request), tomli-starlark (pure starlark implementation), and wasm times:
/usr/bin/time python convert.py
0.19 real 0.17 user 0.01 sys
0.20 real 0.18 user 0.01 sys
0.18 real 0.17 user 0.01 sys
/usr/bin/time bazel-toml build @tomlj//...
0.47 real 0.22 user 0.01 sys
0.48 real 0.22 user 0.01 sys
0.46 real 0.22 user 0.01 sys
/usr/bin/time bazel-toml build @tomli-starlark//...
0.75 real 0.22 user 0.01 sys
0.75 real 0.22 user 0.01 sys
0.76 real 0.23 user 0.01 sys
# Changed to run the execution once.
/usr/bin/time bazel-toml build @wasm-repro//...
2.47 real 0.24 user 0.05 sys
2.37 real 0.25 user 0.02 sys
2.38 real 0.24 user 0.02 sys
So the pure starlark is ~1.5x slower than the native bazel tomlj.
Feel free to run yourself with https://github.com/willstranton/wasm-repro/commit/d397ac674de1f1cb2d8b9a076bb13dbe57100c01 and by building bazel from keith's repo: https://github.com/keith/bazel/commit/50d27b36e571e4a613ef0cce00b515eb0210a387
From my POV if rules_rs extension evaluation goes from 100ms to 1 second that's pretty painful.
Ok - now it looks like it would go from 100ms to either ~460 milliseconds (bazel native), or ~750 milliseconds (starlark native)
I personally [don't] see a good reason to invest in a starlark module (except as a temporary polyfill?) if we have agreement on native support
Fair enough. I'm not too familiar with the use cases and stakeholders here but my impressions for favoring a pure starlark module are:
- Keep bazel binary minimal (tomlj jar on maven is 175kb)
- Decouple logic from Bazel release (is this similar to how various languagerules are being moved out of Bazel?) - so faster? to hack on it (but I guess this doesn't matter, as toml features shouldn't be changing as much as language rules do)
If performance is high on the list of priorities, then I agree that making changes in the Bazel codebase seems to be the only way to achieve that reliably.
Ok - now it looks like it would go from 100ms to either ~460 milliseconds (bazel native), or ~750 milliseconds (starlark native)
This is the actual timing for the native solution I shipped in rules_rs in the end
➜ wasm-repro git:(main) time ./toml2json/target/release/toml2json Cargo.lock >/dev/null ./toml2json/target/release/toml2json Cargo.lock > /dev/null 0.02s user 0.01s system 89% cpu 0.026 total
Anyway, let's stop spamming this PR :) I think benchmarks can be added and performance can be improved in future on the Java side, the starlark will be a lot tougher to optimize