prost icon indicating copy to clipboard operation
prost copied to clipboard

Investigate potential x86 varint optimization

Open danburkert opened this issue 5 years ago • 4 comments

https://www.reddit.com/r/rust/comments/f36j05/comment/fhhwqp9

danburkert avatar Feb 14 '20 02:02 danburkert

see also https://github.com/gnzlbg/bitintr for safe and cross platform wrappers over the intrinsics

danburkert avatar Feb 14 '20 17:02 danburkert

https://news.ycombinator.com/item?id=25183811

danburkert avatar Nov 23 '20 20:11 danburkert

https://www.reddit.com/r/rust/comments/klck6a/i_published_my_first_crate_varintsimd/

danburkert avatar Dec 27 '20 23:12 danburkert

So I did some quick and dirty prototyping with varint-simd v0.3.0, and here's what I found:

  • Microbenchmark varint performance is only improved for encoding and decoding larger numbers
  • Encoding performance generally fares better than decoding performance, depending on where I place the branch that uses varint-simd
  • Macrobenchmark performance is mostly a wash (tested on Coffee Lake), with some larger wins and some smaller losses

This is probably because the only encode/decode function is for single u64's, which is currently a weak point for varint-simd (it's not that much faster than other implementations when decoding/encoding tiny u64's).

I suspect there will need to be some larger-scale refactoring to take full advantage of varint-simd. For example, protobuf tags are up to 32 bits long, so a lot of cycles can be saved when encoding/decoding those. 

My library also just added support for quickly decoding two, four, and eight adjacent varints in parallel (subject to size limitations), with some really good throughput figures - most of the time, protobufs will be a 32 bit tag followed by a 32 bit number or length, and decode requests can be shrunk based on how large the data field is in the .proto file. So there's likely a lot more gains to be had.

as-com avatar Jan 02 '21 04:01 as-com