coreutils
coreutils copied to clipboard
Improving `basenc` / `base32` / `base64`
I've identified a few issues with the encoding/decoding tools that I'm interested in solving:
-
They are not iterative / streaming. The entirety of the input is read first before it is all decoded. One the one hand, this allows you to detect errors upfront before you've output anything. But it means that these tools can't be used for iterative computations (e.g. over
whileloops) and they may consume large amounts of memory. The GNU implementations of these tools are streaming as well, so I'd classify this as an incompatibility bug. -
Significant optimization potential is available. I have a fair bit of experience with SIMD and I've designed AVX2 and AVX512 algorithms for Base32 / Base64 encoding - it's difficult but I would like to see if there are performance benefits to it, particularly for large inputs. Optimizations can be added iteratively, and can be feature-gated or even based on runtime CPU feature detection.
-
Compiled binaries for
base32/base64, on my machine, currently take up 23 MiB in debug mode and 3 MiB in release mode. This is surprisingly large and I believe the majority of that work lies in the argument parsing rather than the actual encoding. Sincebasenc/base32/base64have essentially the same arguments to deal with, they can be combined.
To this end, here's my proposal:
-
Rewrite
src/uucore/src/lib/features/encoding.rsto expose a streaming rather than one-shot interface. Along the way, I can add a sub-module for each encoding format so that per-format optimization routines can be defined. Runtime CPU detection, if implemented, would be format-generic. -
Setup
basencas a multicall binary so thatbase32/base64are symlinks to it. The argument parsing forbasencwill have to be updated to only expose format-selecting arguments ifargv[0]isbasenc. -
Implement format-specific optimization routines over time, supported by benchmarking. For example, if the user only provides less than 1 KiB of data, a less optimized routine can be used (advanced AVX512 instructions have some interesting effects on CPU clock frequencies, so we should avoid them unless a large amount of input needs to be processed).
It may be preferable that I write a new, independent crate for optimized data encoding and create a dependency on that (rather than the current data-encoding crate); the alternative is to eliminate the external dependency completely.
I'd like to ask for prior approval from the uutils community before I start working on this. Note that task 1 and 2 are effectively independent - while I'm more interested in the first, the second seems more important. Maintainers, what do you think?
If the encoding is in a separate crate then I thjnk they should already share a a lot of code and then then the multicall might not be necessary. The only difference would be the construction of the clap command. I think I'd rather keep them as separate crates but with a shared crate underneath.
If we set up that structure first, then we can make that shared crate streaming, because that is indeed a bug. Performance optimizations are welcome, but in my opinion low priority and should not increase the complexity too much. The unoptimized baseline implementation is the most important. So, I'd propose to make it streaming without SIMD etc. first.
If the encoding is in a separate crate then I thjnk they should already share a a lot of code and then then the multicall might not be necessary.
They already share a lot in terms of code, I agree - but due to static linking, the same code is duplicated in every executable. A multicall binary would eliminate that overhead.
Performance optimizations are welcome, but in my opinion low priority and should not increase the complexity too much.
My first priority is the streaming interface, I agree. But I'd like to provide some more complex implementations over time - while providing simple fallbacks, of course. If you don't want that complexity in this repository, the shared crate solution seems best.
Yes a multicall binary would help, but that's what we have the general multicall binary for already. I think multiple levels of multicall binaries is a bit too much. We have this for the checksum utils and I've been wanting to remove that for a while 😄
Alright, then I'll focus on the streaming interface for now. Hopefully better argument parsing systems can bring the per-binary overhead down.
Status update: I've got basic implementations of all the formats together in my crate glassy (documentation). I've got a pretty thorough automated testing setup, but I need to test some failure cases better. Once that's done, I'll polish up the documentation a bit and begin working on a PR integrating it here.
@bal-e i am sorry but I am usually quite reluctant to add a new dependency on brand new crate :( I would prefer you integrate this code directly into the coreutils
@sylvestre There are a few reasons why I think this should be a separate dependency:
-
The code being developed forms a good utility library for any Rust project that needs to perform Base32 encoding/decoding. I already have a fairly stable API for
glassy, and it will not have any dependencies of its own, so I don't believe that usingglassywill incur any significant maintenance burden onuutils. -
Over time, I will be developing
glassyfurther to make it much more performant (this is possible because it only offers the exact formats needed forbasenc- it is not overly flexible).uutilsshould be able to benefit from these performance improvements without having to maintain the code itself. -
With this change, I will be consolidating two existing dependencies of
uutils(data-encodingandz85), so this will actually reduceuutils' total number of dependencies. -
Should
uutilsopt to implement new conversion formats, they can be easily integrated intoglassy.glassyoperates throughEncode/Decodetraits, which anybody can implement, and it exposes all of the utilities its own implementations use for convenience, so new formats can be implemented easily both inside and outsideglassy. I am happy to introduce support for new formats intoglassyor to make the existing formats more flexible.
As it stands, I don't believe that glassy's position as a "brand new crate" should affect how reliable it is. As a standalone Base32/etc. converter, the standards it implements are concrete and unchanging. Even if I were to stop maintaining it, the actual code within will continue working for the foreseeable future. At that stage, it may make sense to fork it and maintain a copy within uutils, but I do plan on maintaining it thoroughly.
At the moment, even the relatively simple (but complete) implementations of the formats needed by basenc add up to two thousand lines of code (three thousand with documentation), and this will increase as I provide more performant implementations. Do you believe that the burden of maintaining this code belongs in uutils?
Yeah, we could move it under the uutils umbrella. Otherwise, I am quite reluctant to add a dependency like this despite all its qualities. I am here to maintain rust-coreutils for the years to come and I am careful when adding new dependencies.
I understand that moving glassy under the uutils organization will make it easier for you to maintain it in the future. But I would appreciate it if you could use the crate as published from my repository. If I ever stop maintaining it, feel free to fork it here.
P.S. The z85 crate that uutils currently depends on was last updated on April 12, 2022. Using the crate I have published may not be perfect from a maintenance perspective, but it should certainly be a better choice than relying on 2-year old code.