bazel icon indicating copy to clipboard operation
bazel copied to clipboard

Add starlark toml module

Open keith opened this issue 1 month ago • 19 comments

This module is analogous to json but for TOML. It supports toml.decode and toml.encode. This is useful for the increasing number of places where toml is used over json / yaml, such as Cargo.toml or uv.lock. Unlike bazel's json support this module depends on a separate open source project for actually parsing and encoding toml support. Most of this change is marshaling these objects between starlark and java representations, which isn't done in the same way anywhere else I could find.

keith avatar Nov 12 '25 20:11 keith

Ha, I needed this for rules_rs and debated between adding bazel support, using toml2json wasm, and prebuilts. Ended up going with prebuilts because of wasm perf issues, but having built-in support would be neat!

dzbarsky avatar Nov 12 '25 22:11 dzbarsky

Many many times have I wished for something like this in rules_rust. It would help immensely

UebelAndre avatar Nov 13 '25 02:11 UebelAndre

FYI, all Google java projects that I've seen use TomlJ for their toml parsing/formatting. So we'd prefer to use that in Bazel too unless there is a strong reason for picking jackson's implementation.

tetromino avatar Nov 13 '25 22:11 tetromino

FYI, all Google java projects that I've seen use TomlJ for their toml parsing/formatting. So we'd prefer to use that in Bazel too unless there is a strong reason for picking jackson's implementation.

TomlJ Doesn't support serialization:

  • https://github.com/tomlj/tomlj/issues/59

And appears to be in a maintenance mode:

  • https://github.com/tomlj/tomlj/issues/44

edit: Some context, I only chimed in to provide information. I definitely want some serialization support in a potential toml module for Bazel, but I don't care what library is used 😄

UebelAndre avatar Nov 13 '25 22:11 UebelAndre

yea i tried that one first actually and those where the reasons ^ are you on that version or a fork?

keith avatar Nov 13 '25 22:11 keith

@UebelAndre - thanks, noted. The fact that it's unmaintained is certainly a strike against.

That said: we need a library for toml parsing, but I'm not sure whether we need a library for toml serialization. Bazel already does custom serialization for textproto (https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/packages/Proto.java) and for json. While parsing toml is a pain due to all the different string variations and ways of specifying keys, I don't think it's hard to lossily serialize (and serialization would have to be lossy anyway, since starlark doesn't support some toml data types, such as dates/times).

tetromino avatar Nov 13 '25 23:11 tetromino

i did consider manually serializing before i ended up switching to this library, so if that is a path we have to go down i agree it would probably be pretty fine. obviously the hardest part is various escaping sequences

keith avatar Nov 13 '25 23:11 keith

I think we need to determine what java toml libraries exist, how active / maintained they are, and how much they'd add to the bazel binary size. (And hopefully we find something that doesn't pull in java.desktop.)

tetromino avatar Nov 13 '25 23:11 tetromino

yea besides binary size i generally did that search already, these are the libraries i could find, including all the ones listed on https://github.com/toml-lang/toml/wiki

  • unmaintained: https://github.com/mwanji/toml4j
  • primarily 1 person, low traffic: https://github.com/TheElectronWill/Night-Config
  • discussed above: https://github.com/tomlj/tomlj
  • pretty new, seems like low usage so far: https://github.com/WasabiThumb/jtoml
  • the one i landed on as it seems to be very widely used https://github.com/FasterXML/jackson-dataformats-text

keith avatar Nov 13 '25 23:11 keith

@keith Could you please take a look at the failing checks?

bharadwaj08-one avatar Nov 19 '25 09:11 bharadwaj08-one

@tetromino ptal. I moved this back to tomlj and manually encoded things

keith avatar Dec 02 '25 18:12 keith

yea besides binary size i generally did that search already, these are the libraries i could find, including all the ones listed on https://github.com/toml-lang/toml/wiki

This is an accurate analysis. jackson-dataformats-toml is a great minimal TOML implementation with adequate test coverage and wide adoption, constituting a great upgrade from tomlj should you get it to work. In terms of binary size, jtoml (my library) is about 3x larger. This is because it exposes a rich type and key system, handles comments, etc. Think Gson for TOML. It's probably overkill for your use case, but there's the relevant distinction.

The date issue is inherent to the pursuit of TOML-JSON exchange, as JSON simply lacks a date/time type. If data integrity is not paramount, date/times should probably be stringified. Otherwise, I would point out the PARSE_JAVA_TIME option which already exists in jackson-dataformats-toml. Perhaps this is something you can leverage?

WasabiThumb avatar Dec 08 '25 11:12 WasabiThumb

@keith, sorry for delay, I have been getting a lot of pushback internally for a adding jackson-dataformats-text dep to Bazel (in part because my coworkers would be on the hook to maintain it in the google monorepo forever). So the decision to use tomlj certainly makes this PR more palatable :)

I am wondering if you have a concrete use case in mind for toml decoding. If it's just nice-to-have but not necessary for what you are doing, I'd leave it out so as to avoid any new dependencies. (For example, Starlark in Bazel has an encoder/decoder for json, but encoder-only for textproto.)

tetromino avatar Dec 09 '25 20:12 tetromino

I'm not Keith but fwiw we need toml decoding for both rules_rs (rust) and rules_py to parse lockfiles coming from language-specific package managers. We are currently shipping prebuilts for limited platforms but having native capability would be very useful to us.

dzbarsky avatar Dec 09 '25 20:12 dzbarsky

I think the rust + python examples are good, I think in general there are just a lot of cases where you might want some sort of config file to feed into a rule / repository rule, and the format may not be up to you, like in the case of Cargo.toml etc. So you potentially have to shell out to another language for something that would otherwise be trivial.

keith avatar Dec 09 '25 20:12 keith

I think the rust + python examples are good, I think in general there are just a lot of cases where you might want some sort of config file to feed into a rule / repository rule, and the format may not be up to you, like in the case of Cargo.toml etc. So you potentially have to shell out to another language for something that would otherwise be trivial.

@dzbarsky - am I understanding correctly that you need either support in Starlark or a prebuilt binary because you're doing the decoding in a repo rule?

If so, that's a very good argument for having a decoder.

tetromino avatar Dec 09 '25 20:12 tetromino

I think the rust + python examples are good, I think in general there are just a lot of cases where you might want some sort of config file to feed into a rule / repository rule, and the format may not be up to you, like in the case of Cargo.toml etc. So you potentially have to shell out to another language for something that would otherwise be trivial.

@dzbarsky - am I understanding correctly that you need either support in Starlark or a prebuilt binary because you're doing the decoding in a repo rule?

If so, that's a very good argument for having a decoder.

Yes, exactly, this work happens in module extensions and repo rules, so we either need a native solution, fast wasm support (I tried and it wasn't fast), or prebuilt binaries.

dzbarsky avatar Dec 09 '25 21:12 dzbarsky

I don't want to deter progress on this pull request, but I wanted to note that I'm doing initial investigations for a pure starlark toml encoder/decoder based on python's tomli. A starlark-ified version without too many limitations could be possible.

Let me see what I can publish by end of the week/next. Hopefully the limitations are not too many/critical, and can satisfy the use cases mentioned here.

willstranton avatar Dec 09 '25 21:12 willstranton

I don't want to deter progress on this pull request, but I wanted to note that I'm doing initial investigations for a pure starlark toml encoder/decoder based on python's tomli. A starlark-ified version without too many limitations could be possible.

Let me see what I can publish by end of the week/next. Hopefully the limitations are not too many/critical, and can satisfy the use cases mentioned here.

One of the drivers here is performance. As you're developing feel free to check out https://github.com/dzbarsky/wasm-repro and run the benchmarks there. The rust version was fast enough and the python one was not. (Wasm was same speed as python). I wouldn't be surprised if starlark ended up close to python.

dzbarsky avatar Dec 09 '25 21:12 dzbarsky

One of the drivers here is performance.... As you're developing feel free to check out...

@dzbarsky See the proof of concept here https://github.com/willstranton/wasm-repro/tree/starlark_toml_poc

When I ran python against starlark, I got the following numbers:

$ time bazel build @tomli-starlark//...
...
real    0m0.820s
user    0m0.015s
sys     0m0.015s

$ time python3 convert.py
...
real    0m0.176s
user    0m0.157s
sys     0m0.019s

The rust version was fast enough and the python one was not.... I wouldn't be surprised if starlark ended up close to python.

Are those results too slow since you said even python wasn't fast enough? The starlark is about 7-8x python's time, but at least it's better than what I'm getting with wasm:

$ time bazel build @wasm-repro//...
...
real    0m4.976s
user    0m0.014s
sys     0m0.018s

Give it a try yourself - I'm not exactly using very beefy hardware at the moment. So I'm curious if you'd see better numbers.


@dzbarsky @keith @tetromino Do you think this is worth pursuing further? Or is the current starlark toml module (this pull requests we're commenting on) preferred?

If you think the pure starlark implementation would be useful/used, I can put it into a proper repo and clean it up.

willstranton avatar Dec 12 '25 11:12 willstranton

Timing wasm isnt just timing the build, it's running the wazero binary.

From my POV if rules_rs extension evaluation goes from 100ms to 1 second that's pretty painful.

I personally son't see a good reason to invest in a starlark module (except as a temporary polyfill?) if we have agreement on native support

dzbarsky avatar Dec 12 '25 13:12 dzbarsky

Timing wasm isnt just timing the build, it's running the wazero binary.

Oops, I forgot you also had it run twice, so you're right - the numbers given aren't being compared fairly. I still use the same comparative numbers below, so be aware future readers!


I checked out this pull request to time it as well (ran on a mac, so time output is different...). Built with bazel build --stamp --embed_label=9.0.0 //src:bazel (in case some optimization flag is missing).

In order of increasing time, and run 3x each are the python, tomlj (this pull request), tomli-starlark (pure starlark implementation), and wasm times:

/usr/bin/time python convert.py
        0.19 real         0.17 user         0.01 sys
        0.20 real         0.18 user         0.01 sys
        0.18 real         0.17 user         0.01 sys
/usr/bin/time bazel-toml build @tomlj//...
        0.47 real         0.22 user         0.01 sys
        0.48 real         0.22 user         0.01 sys
        0.46 real         0.22 user         0.01 sys
/usr/bin/time bazel-toml build @tomli-starlark//...
        0.75 real         0.22 user         0.01 sys
        0.75 real         0.22 user         0.01 sys
        0.76 real         0.23 user         0.01 sys
# Changed to run the execution once.
/usr/bin/time bazel-toml build @wasm-repro//...
        2.47 real         0.24 user         0.05 sys
        2.37 real         0.25 user         0.02 sys
        2.38 real         0.24 user         0.02 sys

So the pure starlark is ~1.5x slower than the native bazel tomlj.

Feel free to run yourself with https://github.com/willstranton/wasm-repro/commit/d397ac674de1f1cb2d8b9a076bb13dbe57100c01 and by building bazel from keith's repo: https://github.com/keith/bazel/commit/50d27b36e571e4a613ef0cce00b515eb0210a387

From my POV if rules_rs extension evaluation goes from 100ms to 1 second that's pretty painful.

Ok - now it looks like it would go from 100ms to either ~460 milliseconds (bazel native), or ~750 milliseconds (starlark native)

I personally [don't] see a good reason to invest in a starlark module (except as a temporary polyfill?) if we have agreement on native support

Fair enough. I'm not too familiar with the use cases and stakeholders here but my impressions for favoring a pure starlark module are:

  • Keep bazel binary minimal (tomlj jar on maven is 175kb)
  • Decouple logic from Bazel release (is this similar to how various languagerules are being moved out of Bazel?) - so faster? to hack on it (but I guess this doesn't matter, as toml features shouldn't be changing as much as language rules do)

If performance is high on the list of priorities, then I agree that making changes in the Bazel codebase seems to be the only way to achieve that reliably.

willstranton avatar Dec 12 '25 19:12 willstranton

Ok - now it looks like it would go from 100ms to either ~460 milliseconds (bazel native), or ~750 milliseconds (starlark native)

This is the actual timing for the native solution I shipped in rules_rs in the end

➜ wasm-repro git:(main) time ./toml2json/target/release/toml2json Cargo.lock >/dev/null ./toml2json/target/release/toml2json Cargo.lock > /dev/null 0.02s user 0.01s system 89% cpu 0.026 total

Anyway, let's stop spamming this PR :) I think benchmarks can be added and performance can be improved in future on the Java side, the starlark will be a lot tougher to optimize

dzbarsky avatar Dec 12 '25 20:12 dzbarsky