prost
prost copied to clipboard
Investigate potential x86 varint optimization
https://www.reddit.com/r/rust/comments/f36j05/comment/fhhwqp9
see also https://github.com/gnzlbg/bitintr for safe and cross platform wrappers over the intrinsics
https://news.ycombinator.com/item?id=25183811
https://www.reddit.com/r/rust/comments/klck6a/i_published_my_first_crate_varintsimd/
So I did some quick and dirty prototyping with varint-simd v0.3.0, and here's what I found:
- Microbenchmark varint performance is only improved for encoding and decoding larger numbers
- Encoding performance generally fares better than decoding performance, depending on where I place the branch that uses varint-simd
- Macrobenchmark performance is mostly a wash (tested on Coffee Lake), with some larger wins and some smaller losses
This is probably because the only encode/decode function is for single u64's, which is currently a weak point for varint-simd (it's not that much faster than other implementations when decoding/encoding tiny u64's).
I suspect there will need to be some larger-scale refactoring to take full advantage of varint-simd. For example, protobuf tags are up to 32 bits long, so a lot of cycles can be saved when encoding/decoding those.
My library also just added support for quickly decoding two, four, and eight adjacent varints in parallel (subject to size limitations), with some really good throughput figures - most of the time, protobufs will be a 32 bit tag followed by a 32 bit number or length, and decode requests can be shrunk based on how large the data field is in the .proto file. So there's likely a lot more gains to be had.