gpb icon indicating copy to clipboard operation
gpb copied to clipboard

Performance improvement for encoding protos with unchanged data

Open tsloughter opened this issue 1 year ago • 6 comments

I feel like I brought this up before but I forget where and didn't see an existing issue opened by me.

Basically I'd like to be able to re-use the unchanged parts of an encoded proto in later calls in cases like: https://github.com/open-telemetry/opentelemetry-erlang/blob/main/apps/opentelemetry_exporter/src/opentelemetry_exporter.erl#L500

In these protos that get encoded on every export of traces are parts like the Resource and each tracer's Instrumentation Scope that do not change. So being able to do something like:

PartialResourceSpans =  opentelemetry_exporter_trace_service_pb:encode_partial(#{resource => #{attributes => to_attributes(otel_attributes:map(Attributes)),
                                                                                        dropped_attributes_count => otel_attributes:dropped(Attributes)}}, export_trace_service_request),
opentelemetry_exporter_trace_service_pb:encode_msg(PartialResourceSpans, #{scope_spans => InstrumentationScopeSpans}, export_trace_service_request),

Then PartialResourceSpans could be kept in the state or an ets table for concurrent look up, and each export only have to encode InstrumentationScopeSpans (where those would also use the partial encoding function on their unchanging data).

Does that make sense? Do you think it would work with how encoding is done and be acceptable if I were to put together a PR?

tsloughter avatar Aug 03 '22 20:08 tsloughter

Interesting idea. How much performance would you gain? Would it be possible to modify the generated pb module to try out something like that?

From the example it looks like the resource field in each element in the repeated scope_spans is always the same and could be encoded just once, and then copied into every element on encoding.

How would one express that there a pre-encoded field should get copied to every element of some field in a message? One might imagine to specify it by path somehow, but maintaining a path at encode-time will also have a cost.

On the protobuf wire-level, a message is a stream of fields. A sub-message is a stream of fields wrapped in a length indicator, prepended with a field number and type. On decoding, if a field occurs multiple times, they are to be merged, which for sub-messages means to merge them recursively. So on some level, when encoding each element in scope_spans, just before wrapping with length and prepending field number and type, there is an opportunity for just concatenating a pre-encoded extra binary with more fields. But how to express that this should happen in an efficient way? Hmm... And how much would you gain?

Toying with ideas: There's already a concept of translating values before encoding or after decoding (the idea was to make it possible to use a more natural type, such as a set or an ip address on tuple format, if the protobuf type is bulky). Maybe it could be possible to extend this concept with also adding a kind of translation that could apply the binary after encoding.

tomas-abrahamsson avatar Aug 05 '22 14:08 tomas-abrahamsson

Initial quick response: I wouldn't expect (not that it wouldn't potentially be done with low enough overhead to be a gain) automatic "path based" caching or anything like that, but was just suggesting a way to pass the already encoded piece to the gpb module along with the remaining part of the message as a map/record.

In the example it is only on exports after the first that these partial messages could be re-used. There is one resource and many instrumentation_scopes. They are all sent on every export even if they don't change. They each have an arbitrary number of attributes (key/value pairs). So the idea is to allow the exporter to just keep an encoded form of resource and each intsrumentation_scope (which the exporter would be responsible for knowing which one to use by looking up the encoded version based on name/version) and pass those to the protobuf encoder instead of the unencoded resource/scope each time.

tsloughter avatar Aug 05 '22 17:08 tsloughter

Oh and modifying the generated pb module to try is a good idea. I'll have to toy around in there to see if I can find where I can insert a call like this.

I still need to get time to look at the translation idea.

tsloughter avatar Aug 05 '22 20:08 tsloughter

I still need to read this blog post closely but I think it details exactly the same pattern -- both how to do it and the example of OpenTelemetry's protos -- https://blog.najaryan.net/posts/partial-protobuf-encoding/

tsloughter avatar Aug 08 '22 00:08 tsloughter

it details exactly the same pattern -- both how to do it and the example of OpenTelemetry's protos -- https://blog.najaryan.net/posts/partial-protobuf-encoding/

Indeed it does. The workaround in that post—to pre-encode and change type of the sub-message to bytes—works with gpb as well. And in that context, it got 2.2 times faster. Food for thought.

tomas-abrahamsson avatar Aug 10 '22 13:08 tomas-abrahamsson

I still haven't gotten to play with this but as I'm working on the metrics exporter I realized another area I've wanted this to be possible. Not with duplicate data but a similar sort of incremental encoding.

Basically I often have to iterate over some data to collect it into the maps that can then be serialized as protos. As in, the internal data representation is not 1:1 with the proto representation.

Mainly just leaving a note so I investigate this as well when I get started :)

tsloughter avatar Sep 06 '22 23:09 tsloughter

I implemented an option, allow_preencoded_submsgs to support this use-case. It has actually been implemented since quite a while back, but I forgot to merge it. Now it is included in 4.20.0, so I'm closing this.

tomas-abrahamsson avatar Oct 01 '23 20:10 tomas-abrahamsson