opentelemetry-java Low allocation OTLP marshaler

trafficstars

This PR aims to explore reducing memory allocations from OTLP marshaling. It takes a slightly different route than https://github.com/open-telemetry/opentelemetry-java/pull/6171 and https://github.com/open-telemetry/opentelemetry-java/pull/6273 Instead of pooling the Marshaler objects here we introduce a MarshalerContext object that keeps the state needed for marshaling. This context can be reset and reused for subsequent marshaling attempts. Currently low allocation marshaling is implemented only for traces and metrics. My laptop doesn't produce reliable performance numbers, I'd expect the low allocation version to perform slightly worse than the original code. cc @asafm

Benchmark                                                                     (numSpans)  Mode  Cnt       Score     Error   Units
RequestMarshalBenchmarks.createCustomMarshal                                          16  avgt   10       5.944 ±   0.230   us/op
RequestMarshalBenchmarks.createCustomMarshal:gc.alloc.rate                            16  avgt   10    2797.889 ± 105.068  MB/sec
RequestMarshalBenchmarks.createCustomMarshal:gc.alloc.rate.norm                       16  avgt   10   17432.003 ±   0.001    B/op
RequestMarshalBenchmarks.createCustomMarshal:gc.count                                 16  avgt   10      76.000            counts
RequestMarshalBenchmarks.createCustomMarshal:gc.time                                  16  avgt   10      83.000                ms
RequestMarshalBenchmarks.createCustomMarshal                                         512  avgt   10     198.188 ±   9.472   us/op
RequestMarshalBenchmarks.createCustomMarshal:gc.alloc.rate                           512  avgt   10    2520.903 ± 117.198  MB/sec
RequestMarshalBenchmarks.createCustomMarshal:gc.alloc.rate.norm                      512  avgt   10  523518.259 ±  15.589    B/op
RequestMarshalBenchmarks.createCustomMarshal:gc.count                                512  avgt   10      69.000            counts
RequestMarshalBenchmarks.createCustomMarshal:gc.time                                 512  avgt   10      76.000                ms
RequestMarshalBenchmarks.createCustomMarshalLowAllocation                             16  avgt   10       4.335 ±   0.238   us/op
RequestMarshalBenchmarks.createCustomMarshalLowAllocation:gc.alloc.rate               16  avgt   10       0.001 ±   0.001  MB/sec
RequestMarshalBenchmarks.createCustomMarshalLowAllocation:gc.alloc.rate.norm          16  avgt   10       0.002 ±   0.001    B/op
RequestMarshalBenchmarks.createCustomMarshalLowAllocation:gc.count                    16  avgt   10         ≈ 0            counts
RequestMarshalBenchmarks.createCustomMarshalLowAllocation                            512  avgt   10     144.746 ±   2.872   us/op
RequestMarshalBenchmarks.createCustomMarshalLowAllocation:gc.alloc.rate              512  avgt   10       0.001 ±   0.001  MB/sec
RequestMarshalBenchmarks.createCustomMarshalLowAllocation:gc.alloc.rate.norm         512  avgt   10       0.077 ±   0.010    B/op
RequestMarshalBenchmarks.createCustomMarshalLowAllocation:gc.count                   512  avgt   10         ≈ 0            counts
RequestMarshalBenchmarks.marshalCustom                                                16  avgt   10      13.881 ±   0.408   us/op
RequestMarshalBenchmarks.marshalCustom:gc.alloc.rate                                  16  avgt   10    1210.876 ±  35.524  MB/sec
RequestMarshalBenchmarks.marshalCustom:gc.alloc.rate.norm                             16  avgt   10   17624.007 ±   0.001    B/op
RequestMarshalBenchmarks.marshalCustom:gc.count                                       16  avgt   10      40.000            counts
RequestMarshalBenchmarks.marshalCustom:gc.time                                        16  avgt   10      42.000                ms
RequestMarshalBenchmarks.marshalCustom                                               512  avgt   10     504.722 ±  13.999   us/op
RequestMarshalBenchmarks.marshalCustom:gc.alloc.rate                                 512  avgt   10     989.631 ±  27.198  MB/sec
RequestMarshalBenchmarks.marshalCustom:gc.alloc.rate.norm                            512  avgt   10  523704.262 ±   0.012    B/op
RequestMarshalBenchmarks.marshalCustom:gc.count                                      512  avgt   10      33.000            counts
RequestMarshalBenchmarks.marshalCustom:gc.time                                       512  avgt   10      37.000                ms
RequestMarshalBenchmarks.marshalCustomLowAllocation                                   16  avgt   10      14.969 ±   1.311   us/op
RequestMarshalBenchmarks.marshalCustomLowAllocation:gc.alloc.rate                     16  avgt   10       7.155 ±   0.603  MB/sec
RequestMarshalBenchmarks.marshalCustomLowAllocation:gc.alloc.rate.norm                16  avgt   10     112.008 ±   0.001    B/op
RequestMarshalBenchmarks.marshalCustomLowAllocation:gc.count                          16  avgt   10       1.000            counts
RequestMarshalBenchmarks.marshalCustomLowAllocation:gc.time                           16  avgt   10       2.000                ms
RequestMarshalBenchmarks.marshalCustomLowAllocation                                  512  avgt   10     521.996 ±  20.715   us/op
RequestMarshalBenchmarks.marshalCustomLowAllocation:gc.alloc.rate                    512  avgt   10       0.205 ±   0.008  MB/sec
RequestMarshalBenchmarks.marshalCustomLowAllocation:gc.alloc.rate.norm               512  avgt   10     112.273 ±   0.022    B/op
RequestMarshalBenchmarks.marshalCustomLowAllocation:gc.count                         512  avgt   10         ≈ 0            counts
RequestMarshalBenchmarks.marshalJson                                                  16  avgt   10      53.794 ±   1.443   us/op
RequestMarshalBenchmarks.marshalJson:gc.alloc.rate                                    16  avgt   10     561.794 ±  14.984  MB/sec
RequestMarshalBenchmarks.marshalJson:gc.alloc.rate.norm                               16  avgt   10   31688.029 ±   0.002    B/op
RequestMarshalBenchmarks.marshalJson:gc.count                                         16  avgt   10      19.000            counts
RequestMarshalBenchmarks.marshalJson:gc.time                                          16  avgt   10      21.000                ms
RequestMarshalBenchmarks.marshalJson                                                 512  avgt   10    1680.151 ±  52.414   us/op
RequestMarshalBenchmarks.marshalJson:gc.alloc.rate                                   512  avgt   10     546.572 ±  17.201  MB/sec
RequestMarshalBenchmarks.marshalJson:gc.alloc.rate.norm                              512  avgt   10  962732.547 ±  17.499    B/op
RequestMarshalBenchmarks.marshalJson:gc.count                                        512  avgt   10      18.000            counts
RequestMarshalBenchmarks.marshalJson:gc.time                                         512  avgt   10      21.000                ms
RequestMarshalBenchmarks.marshalJsonLowAllocation                                     16  avgt   10      53.824 ±   4.461   us/op
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.alloc.rate                       16  avgt   10     165.810 ±  13.209  MB/sec
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.alloc.rate.norm                  16  avgt   10    9336.029 ±   0.002    B/op
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.count                            16  avgt   10       5.000            counts
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.time                             16  avgt   10       6.000                ms
RequestMarshalBenchmarks.marshalJsonLowAllocation                                    512  avgt   10    1647.591 ±  29.917   us/op
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.alloc.rate                      512  avgt   10     156.981 ±   2.836  MB/sec
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.alloc.rate.norm                 512  avgt   10  271228.434 ±  17.073    B/op
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.count                           512  avgt   10       5.000            counts
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.time                            512  avgt   10       5.000                ms

Benchmark                                                                       Mode  Cnt     Score    Error   Units
MetricsRequestMarshalerBenchmark.marshaler                                      avgt   10     2.291 ±  0.063   us/op
MetricsRequestMarshalerBenchmark.marshaler:gc.alloc.rate                        avgt   10  1372.064 ± 38.083  MB/sec
MetricsRequestMarshalerBenchmark.marshaler:gc.alloc.rate.norm                   avgt   10  3296.001 ±  0.001    B/op
MetricsRequestMarshalerBenchmark.marshaler:gc.count                             avgt   10    46.000           counts
MetricsRequestMarshalerBenchmark.marshaler:gc.time                              avgt   10    47.000               ms
MetricsRequestMarshalerBenchmark.marshalerJson                                  avgt   10    10.821 ±  0.287   us/op
MetricsRequestMarshalerBenchmark.marshalerJson:gc.alloc.rate                    avgt   10   709.355 ± 18.802  MB/sec
MetricsRequestMarshalerBenchmark.marshalerJson:gc.alloc.rate.norm               avgt   10  8048.006 ±  0.001    B/op
MetricsRequestMarshalerBenchmark.marshalerJson:gc.count                         avgt   10    23.000           counts
MetricsRequestMarshalerBenchmark.marshalerJson:gc.time                          avgt   10    23.000               ms
MetricsRequestMarshalerBenchmark.marshalerJsonLowAllocation                     avgt   10    10.737 ±  0.363   us/op
MetricsRequestMarshalerBenchmark.marshalerJsonLowAllocation:gc.alloc.rate       avgt   10   341.109 ± 11.322  MB/sec
MetricsRequestMarshalerBenchmark.marshalerJsonLowAllocation:gc.alloc.rate.norm  avgt   10  3840.006 ±  0.001    B/op
MetricsRequestMarshalerBenchmark.marshalerJsonLowAllocation:gc.count            avgt   10    11.000           counts
MetricsRequestMarshalerBenchmark.marshalerJsonLowAllocation:gc.time             avgt   10    12.000               ms
MetricsRequestMarshalerBenchmark.marshalerLowAllocation                         avgt   10     3.247 ±  0.244   us/op
MetricsRequestMarshalerBenchmark.marshalerLowAllocation:gc.alloc.rate           avgt   10    14.127 ±  1.052  MB/sec
MetricsRequestMarshalerBenchmark.marshalerLowAllocation:gc.alloc.rate.norm      avgt   10    48.002 ±  0.001    B/op
MetricsRequestMarshalerBenchmark.marshalerLowAllocation:gc.count                avgt   10     1.000           counts
MetricsRequestMarshalerBenchmark.marshalerLowAllocation:gc.time                 avgt   10     1.000               ms

Mar 28 '24 18:03 laurit

Codecov Report

Attention: Patch coverage is 91.22658% with 103 lines in your changes are missing coverage. Please review.

Project coverage is 91.15%. Comparing base (da4cd93) to head (8edb3d9).

Files	Patch %	Lines
...metry/exporter/internal/marshal/MarshalerUtil.java	90.65%	8 Missing and 9 partials :warning:
...elemetry/exporter/internal/marshal/Serializer.java	86.58%	7 Missing and 4 partials :warning:
...rter/internal/otlp/KeyValueStatelessMarshaler.java	75.00%	5 Missing and 6 partials :warning:
...tlp/traces/InstrumentationScopeSpansMarshaler.java	8.33%	11 Missing :warning:
...nternal/otlp/metrics/MetricStatelessMarshaler.java	87.67%	7 Missing and 2 partials :warning:
...emetry/exporter/internal/marshal/UnsafeString.java	56.25%	4 Missing and 3 partials :warning:
...rnal/otlp/traces/SpanStatusStatelessMarshaler.java	72.72%	4 Missing and 2 partials :warning:
...emetry/exporter/internal/marshal/UnsafeAccess.java	72.22%	4 Missing and 1 partial :warning:
...internal/otlp/ArrayAnyValueStatelessMarshaler.java	77.77%	2 Missing and 2 partials :warning:
...ponentialHistogramDataPointStatelessMarshaler.java	92.15%	0 Missing and 4 partials :warning:
... and 9 more

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #6328      +/-   ##
============================================
+ Coverage     91.12%   91.15%   +0.03%     
- Complexity     5856     6127     +271     
============================================
  Files           636      668      +32     
  Lines         17062    18212    +1150     
  Branches       1733     1804      +71     
============================================
+ Hits          15548    16602    +1054     
- Misses         1019     1068      +49     
- Partials        495      542      +47

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Mar 28 '24 22:03 codecov[bot]

@laurit Thanks for writing this PR. I'll start by explaining the main idea standing behind this PR, compared my second draft PR: https://github.com/open-telemetry/opentelemetry-java/pull/6273

The design I had in mind when writing it, was that for each Message type, which today as a matching Marshal class for marshalling, I would have 2 methods:

messageSize(). It's a static method, which would return a MessageSize object. The idea was that first I would traverse the tree of objects to serialize and create a tree of sizes. So the MessageSize in effect is a node in the tree. It holds a DynamicList - in effect - an array - pointing to the ``MessageSize` of the child nodes of that tree. For example, for

message ResourceMetrics {
  reserved 1000;

  // The resource for the metrics in this message.
  // If this field is not set then no resource info is known.
  opentelemetry.proto.resource.v1.Resource resource = 1;

  // A list of metrics that originate from a resource.
  repeated ScopeMetrics scope_metrics = 2;

  // The Schema URL, if known. This is the identifier of the Schema that the resource data
  // is recorded in. To learn more about Schema URL see
  // https://opentelemetry.io/docs/specs/otel/schemas/#schema-url
  // This schema_url applies to the data in the "resource" field. It does not apply
  // to the data in the "scope_metrics" field which have their own schema_url field.
  string schema_url = 3;
}

MessageSize dynamic list first element would be MessageSize of resource, and second and on would be the size of each ScopeMetrics message.

encode(messageSize, ...) - this would also be a static method on each Marhal class. I traverse the data again, this time I use the messageSize supplied which contains the size I need for each child message.

You're basically saying - building a tree of size, means having a pool of MessageSize and a pool of DynamicList each having different size. Let's remove all those objects all together.

How? You suggest representing the size tree as a list. You traverse the data in the messageSize() static method of each Marshal class, but instead of preparing a MessageSize you actually add the size to the list. In the encode static method, you traverse the size list and in exactly the same order. Meaning, you only need a Queue of sizes and you build and traverse it in the same way of a queue: first in, first out.

I agree that it will have better performance in terms of allocation. On first read of your code, I found it very confusing. After thinking some more I realized that the confusing it mainly caused by the way the code is written. I'm pretty sure with good naming / abstractions, this can be made very clear.

To summarize the first big change suggested in this PR: I agree it is much more performant, and can be made readable. I need to try it to see how readable I can make it.

Second big change was made to how you serialize repeated element and mostly how the Serializer is not aware of the MasrhalerContext which is in fact the sizes tree. I don't like it, and I think the serilzer should be agnostic of the business logic it need to serialize. Avoiding it is not so hard and you it's not a performance loss.

@jack-berg I can try creating a 3rd draft based on this idea, but it's an effort so I'd like your opinion on the design before implementing it.

Mar 31 '24 10:03 asafm

I wanted to get familiar with these ideas so I pulled down this branch, and made a little E2E POC of how to use this concept to serialize a simple, contrived proto data structure:

Branch is here, commit is here
Built a simple proto data structure with some of the things we need to account for in OTLP (strings, nested messaged, repeated fields)
Build corresponding java in-memory representations for the data structure akin to MetricData, SpanData, etc, along with marshalers here
I agree with @asafm that its a bit hard to understand the code as is, but I also agree that we can improve it with the right abstractions / naming. I introduced an interface called StatelessMarshaler which mirrors Marshaler but with adjusted APIs to include MarshalerContext for passing state around. Implementations of StatelessMarshaler are singletons. Added some helpers to Serializer and Marshaler with adjusted APIs for StatelessMarshaler.
Here's a test demonstrating end to end behavior. It encodes in JSON format just for demonstration purposes, and spits out: {"p1":"value1","p2":{"c1":[{"g1":["1","2"]},{"g1":["3","4"]}],"c2":1},"p3":"value2"} just as expected. Nice.
The key for writing these marshalers is to ensure that everything that is added to the MarshalerContext stack when computing size is popped off in exactly the same order. To reduce the chance of errors, we'll need good abstractions / helpers, and good unit test coverage.

All this is to say that I have a better understanding of what's happening now, and I'm on board with the approach given we do some things to improve readability.

Apr 02 '24 14:04 jack-berg

@jack-berg I have modified the PR as you suggested

Apr 05 '24 17:04 laurit

Sorry for the late reply - had a lecture I was giving at a conference.

First, I really love the technique you used here - Take a very big problem, and have a small model, so we can iterate on the concepts of solution/design. Very nice.

Second, I like the idea of making the Marshaler implement a new interface and making them an instance. This makes it easier to read the code since you can pass the object and avoid the Consumer<....> pattern which was complicated to read. Very nice methodology.

Third, I saw that in your implementation you are only saving the traversed message sizes, and not the data it self - which is an approach I prefer as well.

Regarding MarshalUtil.getUf8Size - it doesn't take into account the field tag size. Perhaps we need something like:

  /** Returns the size of a string field, encoded using UTF-8 encoding. */
  public static int sizeStringUtf8(ProtoFieldInfo field, String message) {
    if (message.isEmpty()) {
      return 0;
    }
    return field.getTagSize() + CodedOutputStream.computeStringSizeNoTag(message);
  }

Where we can choose whether to use CodedOutputStream.computeStringSizeNoTag(message); or the implementation of just calculating the length that exists in your branch. If we do use that latter, I believe we need to attribute where we took it from? It looks a tad different compared to https://github.com/google/guava/blob/master/guava/src/com/google/common/base/Utf8.java#L49

Regarding ParentMarshaler.getBinarySize:

    int o1Size = MarshalerUtil.getUtf8Size(value.getO1());
    context.addSize(o1Size);
    size += o1Size;

Here we forgot to take into account the field tag size

    size += ChildMarshaler.getInstance().getBinarySerializedSize(context, value.getO2());

    int o3Size = MarshalerUtil.getUtf8Size(value.getO3());
    context.addSize(o3Size);
    size += o3Size;

Here we forgot to .addSize() for the size += ChildMarshaler.getInstance().getBinarySerializedSize(context, value.getO2());

Regarding MarshalerContext methods of addSize() and getSize(): I wouldn't place them directly on the context class. I would add something like

  	class PrimitiveIntQueue {
  		DynamicPrimitiveIntList sizes; // like DynamicPrimitiveLongList 
  		int writeIndex	
  		int readIndex

  		void enqueue(size)
  		int dequeue(size)
  	}

and use it :

  	class MarshalerContext {
  	   // ...
  	   PrimitiveIntQueue binarySizesQueue

  	   PrimitiveIntQueue binarySizes()
  	}

using it looks like:

	output.serializeString(Parent.P1, value.getO1(), context.binarySizes().dequeue());

instead of:

  	output.serializeString(Parent.P1, value.getO1(), context.getSize());

I'm trying to convey that we have a queue of sizes, and by seeing Enqueue and Dequeue at getBinarySize() and Encode methods, in the same Marshaler class it will make more sense for the reader. It also moves the Queue implementation away from the MarshalerContext class making it less busy.

General question - Once we have an agreement on how to proceed - the design - can I proceed to implementing it in the metrics? Or do you want @jack-berg to do it? How do not step on each others tows, also for traces.

Apr 07 '24 09:04 asafm

I like where this is heading! Here are steps I see needed to proceed:

Let's land one signal first for easy of review. Spans seem simplest because they avoid extra layers of hierarchy that metrics have. @laurit if you're up for it, can you open a new PR with the scope reduced to spans? If not, I'm happy to help.
Let's refactor the existing marshaler tests to be parameterized, such that we can confirm that we have like for like behavior with the low allocation version of the marshalers. For spans, this would mean TraceRequestMarshalerTest
To limit scope, let's not wire things up into the exporters yet. Can do that in a followup.

General question - Once we have an agreement on how to proceed - the design - can I proceed to implementing it in the metrics? Or do you want @jack-berg to do it? How do not step on each others tows, also for traces.

@asafm I know your ability to work on this has changed, but I think we work together to land a PR establishing the patterns. After, can work on the other signals in parallel. Can also work on wiring this into the collectors and configuration in parallel after the first signal is landed.

@laurit let me know how you want to proceed. Happy to run with any of this or let you adjust scope to land the first PR, depending on your availability.

Apr 17 '24 23:04 jack-berg

if you're up for it, can you open a new PR with the scope reduced to spans? If not, I'm happy to help.

I'll start working on it tomorrow.

Apr 18 '24 13:04 laurit

FYI @laurit - I'm ready to review this whenever you give me the heads up. I still see it has draft status and with a bunch of the metrics changes, which I think we should hold off on to make the review more manageable, so holding off unless until I hear otherwise.

Apr 25 '24 20:04 jack-berg

This has been superseded by other PRs.

May 07 '24 14:05 jack-berg

opentelemetry-java opentelemetry-java copied to clipboard

Low allocation OTLP marshaler

Codecov Report

opentelemetry-java
opentelemetry-java copied to clipboard