gpb proposal for performance improvements

proposal for performance improvements

Open sasa1977 opened this issue 2 years ago • 9 comments

Recently I analyzed the encoding performance of this library for my clients, who need to encode somewhat larger messages fairly frequently. After some analysis and experimentation, I was able to improve the average encoding speed from 5.20ms to 1.07ms. Here is a quick bench comparison of the optimized encoder with enif_protobuf and Elixir protobuf:

Name                    ips        average  deviation         median         99th %
enif_protobuf       1303.73        0.77 ms     ±4.71%        0.76 ms        0.81 ms
gpb                  938.15        1.07 ms     ±2.19%        1.07 ms        1.12 ms
protobuf             668.39        1.50 ms    ±10.15%        1.46 ms        2.08 ms

Comparison: 
enif_protobuf       1303.73
gpb                  938.15 - 1.39x slower +0.30 ms
protobuf             668.39 - 1.95x slower +0.73 ms

Originally, gpb was the slowest of the bunch, about 7x slower than enif_protobuf.

I've done two changes, in a hacky way. I'd like to discuss the possibility of properly contributing these changes upstream. The cahnges are:

Switching MsgDefs representation from list to map.

The case I was benching encodesa deeply nested structure, while the size of MsgDefs is about 1000 elements. In this case encoding requires frequent sequential scans of the list. Converting the representation to a map reduced the encoding time by about 2.5x.
Using iolist while building the encoded binary.

The encoder performs a lot of binary concatenations internally, which requires frequent binary expansions. Switching to iolist further reduced the encoding time by about 2x. This can also simplify the implementation a bit, because some recursions can be changed from tail to body, and can even be replaced with lists:map or comprehension.

Would you be open to accept these changes upstream? As mentioned, these are currently done in a hacky way, just as a proof of concept. I need to redo them properly, and in the case of the 1st change also adapt the decoder and the generated code.

Jul 10 '22 11:07 sasa1977

Hi, interesting stuff!

About switching MsgDef to a map: I'm getting the impression you are using the gpb module directly as opposed to generating an encoder/decoder module and using it? If yes, then you would probably get some performance improvement more-or-less for free if you can switch over to using a generated module. Can you, or are you limited to gpb due to something? Maybe you or your clients are using the Elixir protobuf, in case it would imply some restrictions? I'm not familiar with how it works. I'd be interested to learn more about your use case. The generated encoder/decoder does not traverse the list of definitions. Instead, the definitions are embedded in the structure of the generated code.

I've kept the data-driven encoder/decoder in the gpb module mostly to be able to cross-check encoding/decoding results against generated modules, but I have generally not spent any efforts on performance or other bells and whistles here. This has gone into the code generator instead.

Regarding maps, I just opened an issue over in hexpm/hex_core#134 for what the oldest OTP version to support. (If it is still 17, it would impose some limitations on how one can phrase the maps expressions)

Regarding the api of the (assumedly intended) gpb module, I think an approach could be to work on maps internally, but the api would still need to accept defs also as a list (bwd compat), and in that case convert the defs to maps as a first step. As the documented definitions format is a list, I guess it would make sense to also expose, as an api function, that function that will turn the definition list into a map. (An alternative is of course to define a new version of the definitions format, but I think would be more work, since then the code generator needs to be adapted as well.)

Regarding io-lists, do you have any performance figures for only this part of the proposed change? Currently, the code relies on the Erlang optimization that binaries are initially write-appendable under the hood. This is at least for the generated code, maybe also for the gpb module (don't remember,) but again, are you referring to the data-driven encoder/decoder in the gpb module? I tried earlier with iolists instead, but didn't find it made any much of a speedup, if I remember correctly. Unfortunately, I don't think I have any results to share anymore and it was quite some time ago.

Jul 10 '22 13:07 tomas-abrahamsson

I'm getting the impression you are using the gpb module directly as opposed to generating an encoder/decoder module and using it?

Yeah, I was benching and optimizing gpb:encode_msg. I didn't look at the generated encoder, but I naively assumed that the code in those modules would internally use gpb. Is that not the case, i.e. you're saying that gpb is basically a parallel implementation of the encoder/decoder?

Regarding io-lists, do you have any performance figures for only this part of the proposed change?

As mentioned in the first comment, switching to iolists brought an extra 2x improvement (on the particular data structure I was benching).

Currently, the code relies on the Erlang optimization that binaries are initially write-appendable under the hood.

Yes, binary is write-appendable, but it still needs to be reallocated if it doesn't have enough space (source). Given a large enough input data structure, I'd expect frequent reallocation. With iolists, these reallocations can be avoided. Only at the end we invoke an efficient iolist_to_binary to produce the encoded bytes.

are you limited to gpb due to something?

I'll double check and report back.

Jul 10 '22 15:07 sasa1977

As mentioned in the first comment, switching to iolists brought an extra 2x improvement (on the particular data structure I was benching).

Sorry, this statement is misleading. The speed up was 2x after the maps optimization. But disregarding that, and speaking in absolute numbers, with the test input I'm using the master version encoding takes 5ms on average. If I switch to iolists, it takes 4ms on average.

Jul 10 '22 16:07 sasa1977

Is that not the case, i.e. you're saying that gpb is basically a parallel implementation of the encoder/decoder?

Yes, that's correct, it is a parallel implementation, there is no run-time dependency from the generated code to gpb.

but it still needs to be reallocated if it doesn't have enough space [...] with the test input I'm using the master version encoding takes 5ms on average. If I switch to iolists, it takes 4ms on average.

Good point about reallocations, I didn't think about that. And 20% improvement on encoding (for this particular input) is indeed something :) So this seems it would be a worthwhile improvement.

There could probably a break-even somewhere if a binary of an iolist is small, to use integers instead in case they are below 256. I'm thinking about memory usage for binaries vs integers, as described in the efficiency guide For example let's say we have a field that is of type bytes, let's say the field's number is 1 and its value is 17 bytes. The wiretype for a length-delimited field with number 1 is (1 bsl 3) + 2 = 10 which is below 256. The length 17 is below 128, so the varint-encoding of the length will also be below 256. Then it would be slightly more memory-efficient to store it in iolist as [10, 17, <<17 bytes>>] than the more general [<<10>>, <<17>>, <<17 bytes>>] or [<<10, 17>>, <<17 bytes>>] But this could well be a premature optimization that just results in too complex code.

Jul 10 '22 19:07 tomas-abrahamsson

Yes, that's correct, it is a parallel implementation, there is no run-time dependency from the generated code to gpb.

How stable would you say the gbp implementation is? I'm asking because my clients are currently using that one, and it seems that switching to the generated code might be a pretty large undertaking at this point.

Jul 10 '22 20:07 sasa1977

Stable in what sense? I'd say both implementations are fairly well tested. Most work has gone into the generated code. Both since it is a bit more complex problem, but also because there are more options for it.

I forgot to mention that for the generated code, there is also a nif option to generate code that uses Google's C++ protobuf library via NIFs to encode and decode, and a bit more performance can be squeezed out with the bypass_wrappers option. But there are some caveats if you plan to use or switch between overlapping set of proto definitions, see this section of the README.nif-cc and the build process becomes yet a bit more complex of course.

Jul 10 '22 21:07 tomas-abrahamsson

OK thanks! Let me know if you're interested in accepting these two perf improvements for the gpb module.

Jul 11 '22 06:07 sasa1977

Yes, definitely, I think they'd be nice improvements.

Jul 11 '22 06:07 tomas-abrahamsson

Thanks for both PRs. I will take a look.

Jul 11 '22 20:07 tomas-abrahamsson

gpb gpb copied to clipboard

proposal for performance improvements

gpb
gpb copied to clipboard