opentelemetry-java
opentelemetry-java copied to clipboard
Low allocation OTLP marshaler
This PR aims to explore reducing memory allocations from OTLP marshaling. It takes a slightly different route than https://github.com/open-telemetry/opentelemetry-java/pull/6171 and
https://github.com/open-telemetry/opentelemetry-java/pull/6273 Instead of pooling the Marshaler objects here we introduce a MarshalerContext object that keeps the state needed for marshaling. This context can be reset and reused for subsequent marshaling attempts. Currently low allocation marshaling is implemented only for traces and metrics. My laptop doesn't produce reliable performance numbers, I'd expect the low allocation version to perform slightly worse than the original code.
cc @asafm
Benchmark (numSpans) Mode Cnt Score Error Units
RequestMarshalBenchmarks.createCustomMarshal 16 avgt 10 5.944 ± 0.230 us/op
RequestMarshalBenchmarks.createCustomMarshal:gc.alloc.rate 16 avgt 10 2797.889 ± 105.068 MB/sec
RequestMarshalBenchmarks.createCustomMarshal:gc.alloc.rate.norm 16 avgt 10 17432.003 ± 0.001 B/op
RequestMarshalBenchmarks.createCustomMarshal:gc.count 16 avgt 10 76.000 counts
RequestMarshalBenchmarks.createCustomMarshal:gc.time 16 avgt 10 83.000 ms
RequestMarshalBenchmarks.createCustomMarshal 512 avgt 10 198.188 ± 9.472 us/op
RequestMarshalBenchmarks.createCustomMarshal:gc.alloc.rate 512 avgt 10 2520.903 ± 117.198 MB/sec
RequestMarshalBenchmarks.createCustomMarshal:gc.alloc.rate.norm 512 avgt 10 523518.259 ± 15.589 B/op
RequestMarshalBenchmarks.createCustomMarshal:gc.count 512 avgt 10 69.000 counts
RequestMarshalBenchmarks.createCustomMarshal:gc.time 512 avgt 10 76.000 ms
RequestMarshalBenchmarks.createCustomMarshalLowAllocation 16 avgt 10 4.335 ± 0.238 us/op
RequestMarshalBenchmarks.createCustomMarshalLowAllocation:gc.alloc.rate 16 avgt 10 0.001 ± 0.001 MB/sec
RequestMarshalBenchmarks.createCustomMarshalLowAllocation:gc.alloc.rate.norm 16 avgt 10 0.002 ± 0.001 B/op
RequestMarshalBenchmarks.createCustomMarshalLowAllocation:gc.count 16 avgt 10 ≈ 0 counts
RequestMarshalBenchmarks.createCustomMarshalLowAllocation 512 avgt 10 144.746 ± 2.872 us/op
RequestMarshalBenchmarks.createCustomMarshalLowAllocation:gc.alloc.rate 512 avgt 10 0.001 ± 0.001 MB/sec
RequestMarshalBenchmarks.createCustomMarshalLowAllocation:gc.alloc.rate.norm 512 avgt 10 0.077 ± 0.010 B/op
RequestMarshalBenchmarks.createCustomMarshalLowAllocation:gc.count 512 avgt 10 ≈ 0 counts
RequestMarshalBenchmarks.marshalCustom 16 avgt 10 13.881 ± 0.408 us/op
RequestMarshalBenchmarks.marshalCustom:gc.alloc.rate 16 avgt 10 1210.876 ± 35.524 MB/sec
RequestMarshalBenchmarks.marshalCustom:gc.alloc.rate.norm 16 avgt 10 17624.007 ± 0.001 B/op
RequestMarshalBenchmarks.marshalCustom:gc.count 16 avgt 10 40.000 counts
RequestMarshalBenchmarks.marshalCustom:gc.time 16 avgt 10 42.000 ms
RequestMarshalBenchmarks.marshalCustom 512 avgt 10 504.722 ± 13.999 us/op
RequestMarshalBenchmarks.marshalCustom:gc.alloc.rate 512 avgt 10 989.631 ± 27.198 MB/sec
RequestMarshalBenchmarks.marshalCustom:gc.alloc.rate.norm 512 avgt 10 523704.262 ± 0.012 B/op
RequestMarshalBenchmarks.marshalCustom:gc.count 512 avgt 10 33.000 counts
RequestMarshalBenchmarks.marshalCustom:gc.time 512 avgt 10 37.000 ms
RequestMarshalBenchmarks.marshalCustomLowAllocation 16 avgt 10 14.969 ± 1.311 us/op
RequestMarshalBenchmarks.marshalCustomLowAllocation:gc.alloc.rate 16 avgt 10 7.155 ± 0.603 MB/sec
RequestMarshalBenchmarks.marshalCustomLowAllocation:gc.alloc.rate.norm 16 avgt 10 112.008 ± 0.001 B/op
RequestMarshalBenchmarks.marshalCustomLowAllocation:gc.count 16 avgt 10 1.000 counts
RequestMarshalBenchmarks.marshalCustomLowAllocation:gc.time 16 avgt 10 2.000 ms
RequestMarshalBenchmarks.marshalCustomLowAllocation 512 avgt 10 521.996 ± 20.715 us/op
RequestMarshalBenchmarks.marshalCustomLowAllocation:gc.alloc.rate 512 avgt 10 0.205 ± 0.008 MB/sec
RequestMarshalBenchmarks.marshalCustomLowAllocation:gc.alloc.rate.norm 512 avgt 10 112.273 ± 0.022 B/op
RequestMarshalBenchmarks.marshalCustomLowAllocation:gc.count 512 avgt 10 ≈ 0 counts
RequestMarshalBenchmarks.marshalJson 16 avgt 10 53.794 ± 1.443 us/op
RequestMarshalBenchmarks.marshalJson:gc.alloc.rate 16 avgt 10 561.794 ± 14.984 MB/sec
RequestMarshalBenchmarks.marshalJson:gc.alloc.rate.norm 16 avgt 10 31688.029 ± 0.002 B/op
RequestMarshalBenchmarks.marshalJson:gc.count 16 avgt 10 19.000 counts
RequestMarshalBenchmarks.marshalJson:gc.time 16 avgt 10 21.000 ms
RequestMarshalBenchmarks.marshalJson 512 avgt 10 1680.151 ± 52.414 us/op
RequestMarshalBenchmarks.marshalJson:gc.alloc.rate 512 avgt 10 546.572 ± 17.201 MB/sec
RequestMarshalBenchmarks.marshalJson:gc.alloc.rate.norm 512 avgt 10 962732.547 ± 17.499 B/op
RequestMarshalBenchmarks.marshalJson:gc.count 512 avgt 10 18.000 counts
RequestMarshalBenchmarks.marshalJson:gc.time 512 avgt 10 21.000 ms
RequestMarshalBenchmarks.marshalJsonLowAllocation 16 avgt 10 53.824 ± 4.461 us/op
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.alloc.rate 16 avgt 10 165.810 ± 13.209 MB/sec
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.alloc.rate.norm 16 avgt 10 9336.029 ± 0.002 B/op
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.count 16 avgt 10 5.000 counts
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.time 16 avgt 10 6.000 ms
RequestMarshalBenchmarks.marshalJsonLowAllocation 512 avgt 10 1647.591 ± 29.917 us/op
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.alloc.rate 512 avgt 10 156.981 ± 2.836 MB/sec
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.alloc.rate.norm 512 avgt 10 271228.434 ± 17.073 B/op
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.count 512 avgt 10 5.000 counts
RequestMarshalBenchmarks.marshalJsonLowAllocation:gc.time 512 avgt 10 5.000 ms
Benchmark Mode Cnt Score Error Units
MetricsRequestMarshalerBenchmark.marshaler avgt 10 2.291 ± 0.063 us/op
MetricsRequestMarshalerBenchmark.marshaler:gc.alloc.rate avgt 10 1372.064 ± 38.083 MB/sec
MetricsRequestMarshalerBenchmark.marshaler:gc.alloc.rate.norm avgt 10 3296.001 ± 0.001 B/op
MetricsRequestMarshalerBenchmark.marshaler:gc.count avgt 10 46.000 counts
MetricsRequestMarshalerBenchmark.marshaler:gc.time avgt 10 47.000 ms
MetricsRequestMarshalerBenchmark.marshalerJson avgt 10 10.821 ± 0.287 us/op
MetricsRequestMarshalerBenchmark.marshalerJson:gc.alloc.rate avgt 10 709.355 ± 18.802 MB/sec
MetricsRequestMarshalerBenchmark.marshalerJson:gc.alloc.rate.norm avgt 10 8048.006 ± 0.001 B/op
MetricsRequestMarshalerBenchmark.marshalerJson:gc.count avgt 10 23.000 counts
MetricsRequestMarshalerBenchmark.marshalerJson:gc.time avgt 10 23.000 ms
MetricsRequestMarshalerBenchmark.marshalerJsonLowAllocation avgt 10 10.737 ± 0.363 us/op
MetricsRequestMarshalerBenchmark.marshalerJsonLowAllocation:gc.alloc.rate avgt 10 341.109 ± 11.322 MB/sec
MetricsRequestMarshalerBenchmark.marshalerJsonLowAllocation:gc.alloc.rate.norm avgt 10 3840.006 ± 0.001 B/op
MetricsRequestMarshalerBenchmark.marshalerJsonLowAllocation:gc.count avgt 10 11.000 counts
MetricsRequestMarshalerBenchmark.marshalerJsonLowAllocation:gc.time avgt 10 12.000 ms
MetricsRequestMarshalerBenchmark.marshalerLowAllocation avgt 10 3.247 ± 0.244 us/op
MetricsRequestMarshalerBenchmark.marshalerLowAllocation:gc.alloc.rate avgt 10 14.127 ± 1.052 MB/sec
MetricsRequestMarshalerBenchmark.marshalerLowAllocation:gc.alloc.rate.norm avgt 10 48.002 ± 0.001 B/op
MetricsRequestMarshalerBenchmark.marshalerLowAllocation:gc.count avgt 10 1.000 counts
MetricsRequestMarshalerBenchmark.marshalerLowAllocation:gc.time avgt 10 1.000 ms
Codecov Report
Attention: Patch coverage is 91.22658% with 103 lines in your changes are missing coverage. Please review.
Project coverage is 91.15%. Comparing base (
da4cd93) to head (8edb3d9).
Additional details and impacted files
@@ Coverage Diff @@
## main #6328 +/- ##
============================================
+ Coverage 91.12% 91.15% +0.03%
- Complexity 5856 6127 +271
============================================
Files 636 668 +32
Lines 17062 18212 +1150
Branches 1733 1804 +71
============================================
+ Hits 15548 16602 +1054
- Misses 1019 1068 +49
- Partials 495 542 +47
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@laurit Thanks for writing this PR. I'll start by explaining the main idea standing behind this PR, compared my second draft PR: https://github.com/open-telemetry/opentelemetry-java/pull/6273
The design I had in mind when writing it, was that for each Message type, which today as a matching Marshal class for marshalling, I would have 2 methods:
messageSize(). It's a static method, which would return aMessageSizeobject. The idea was that first I would traverse the tree of objects to serialize and create a tree of sizes. So theMessageSizein effect is a node in the tree. It holds aDynamicList- in effect - an array - pointing to the ``MessageSize` of the child nodes of that tree. For example, for
message ResourceMetrics {
reserved 1000;
// The resource for the metrics in this message.
// If this field is not set then no resource info is known.
opentelemetry.proto.resource.v1.Resource resource = 1;
// A list of metrics that originate from a resource.
repeated ScopeMetrics scope_metrics = 2;
// The Schema URL, if known. This is the identifier of the Schema that the resource data
// is recorded in. To learn more about Schema URL see
// https://opentelemetry.io/docs/specs/otel/schemas/#schema-url
// This schema_url applies to the data in the "resource" field. It does not apply
// to the data in the "scope_metrics" field which have their own schema_url field.
string schema_url = 3;
}
MessageSize dynamic list first element would be MessageSize of resource, and second and on would be the size of each ScopeMetrics message.
encode(messageSize, ...) - this would also be a static method on each Marhal class. I traverse the data again, this time I use the messageSize supplied which contains the size I need for each child message.
You're basically saying - building a tree of size, means having a pool of MessageSize and a pool of DynamicList each having different size.
Let's remove all those objects all together.
How? You suggest representing the size tree as a list. You traverse the data in the messageSize() static method of each Marshal class, but instead of preparing a MessageSize you actually add the size to the list. In the encode static method, you traverse the size list and in exactly the same order. Meaning, you only need a Queue of sizes and you build and traverse it in the same way of a queue: first in, first out.
I agree that it will have better performance in terms of allocation. On first read of your code, I found it very confusing. After thinking some more I realized that the confusing it mainly caused by the way the code is written. I'm pretty sure with good naming / abstractions, this can be made very clear.
To summarize the first big change suggested in this PR: I agree it is much more performant, and can be made readable. I need to try it to see how readable I can make it.
Second big change was made to how you serialize repeated element and mostly how the Serializer is not aware of the MasrhalerContext which is in fact the sizes tree. I don't like it, and I think the serilzer should be agnostic of the business logic it need to serialize. Avoiding it is not so hard and you it's not a performance loss.
@jack-berg I can try creating a 3rd draft based on this idea, but it's an effort so I'd like your opinion on the design before implementing it.
I wanted to get familiar with these ideas so I pulled down this branch, and made a little E2E POC of how to use this concept to serialize a simple, contrived proto data structure:
- Branch is here, commit is here
- Built a simple proto data structure with some of the things we need to account for in OTLP (strings, nested messaged, repeated fields)
- Build corresponding java in-memory representations for the data structure akin to
MetricData,SpanData, etc, along with marshalers here - I agree with @asafm that its a bit hard to understand the code as is, but I also agree that we can improve it with the right abstractions / naming. I introduced an interface called StatelessMarshaler which mirrors
Marshalerbut with adjusted APIs to includeMarshalerContextfor passing state around. Implementations ofStatelessMarshalerare singletons. Added some helpers toSerializerandMarshalerwith adjusted APIs forStatelessMarshaler. - Here's a test demonstrating end to end behavior. It encodes in JSON format just for demonstration purposes, and spits out:
{"p1":"value1","p2":{"c1":[{"g1":["1","2"]},{"g1":["3","4"]}],"c2":1},"p3":"value2"}just as expected. Nice. - The key for writing these marshalers is to ensure that everything that is added to the
MarshalerContextstack when computing size is popped off in exactly the same order. To reduce the chance of errors, we'll need good abstractions / helpers, and good unit test coverage.
All this is to say that I have a better understanding of what's happening now, and I'm on board with the approach given we do some things to improve readability.
@jack-berg I have modified the PR as you suggested
Sorry for the late reply - had a lecture I was giving at a conference.
First, I really love the technique you used here - Take a very big problem, and have a small model, so we can iterate on the concepts of solution/design. Very nice.
Second, I like the idea of making the Marshaler implement a new interface and making them an instance. This makes it easier to read the code since you can pass the object and avoid the Consumer<....> pattern which was complicated to read. Very nice methodology.
Third, I saw that in your implementation you are only saving the traversed message sizes, and not the data it self - which is an approach I prefer as well.
Regarding MarshalUtil.getUf8Size - it doesn't take into account the field tag size.
Perhaps we need something like:
/** Returns the size of a string field, encoded using UTF-8 encoding. */
public static int sizeStringUtf8(ProtoFieldInfo field, String message) {
if (message.isEmpty()) {
return 0;
}
return field.getTagSize() + CodedOutputStream.computeStringSizeNoTag(message);
}
Where we can choose whether to use CodedOutputStream.computeStringSizeNoTag(message); or the implementation of just calculating the length that exists in your branch. If we do use that latter, I believe we need to attribute where we took it from? It looks a tad different compared to https://github.com/google/guava/blob/master/guava/src/com/google/common/base/Utf8.java#L49
Regarding ParentMarshaler.getBinarySize:
int o1Size = MarshalerUtil.getUtf8Size(value.getO1());
context.addSize(o1Size);
size += o1Size;
Here we forgot to take into account the field tag size
size += ChildMarshaler.getInstance().getBinarySerializedSize(context, value.getO2());
int o3Size = MarshalerUtil.getUtf8Size(value.getO3());
context.addSize(o3Size);
size += o3Size;
Here we forgot to .addSize() for the size += ChildMarshaler.getInstance().getBinarySerializedSize(context, value.getO2());
Regarding MarshalerContext methods of addSize() and getSize(): I wouldn't place them directly on the context class. I would add something like
class PrimitiveIntQueue {
DynamicPrimitiveIntList sizes; // like DynamicPrimitiveLongList
int writeIndex
int readIndex
void enqueue(size)
int dequeue(size)
}
and use it :
class MarshalerContext {
// ...
PrimitiveIntQueue binarySizesQueue
PrimitiveIntQueue binarySizes()
}
using it looks like:
output.serializeString(Parent.P1, value.getO1(), context.binarySizes().dequeue());
instead of:
output.serializeString(Parent.P1, value.getO1(), context.getSize());
I'm trying to convey that we have a queue of sizes, and by seeing Enqueue and Dequeue at getBinarySize() and Encode methods, in the same Marshaler class it will make more sense for the reader. It also moves the Queue implementation away from the MarshalerContext class making it less busy.
General question - Once we have an agreement on how to proceed - the design - can I proceed to implementing it in the metrics? Or do you want @jack-berg to do it? How do not step on each others tows, also for traces.
I like where this is heading! Here are steps I see needed to proceed:
- Let's land one signal first for easy of review. Spans seem simplest because they avoid extra layers of hierarchy that metrics have. @laurit if you're up for it, can you open a new PR with the scope reduced to spans? If not, I'm happy to help.
- Let's refactor the existing marshaler tests to be parameterized, such that we can confirm that we have like for like behavior with the low allocation version of the marshalers. For spans, this would mean TraceRequestMarshalerTest
- To limit scope, let's not wire things up into the exporters yet. Can do that in a followup.
General question - Once we have an agreement on how to proceed - the design - can I proceed to implementing it in the metrics? Or do you want @jack-berg to do it? How do not step on each others tows, also for traces.
@asafm I know your ability to work on this has changed, but I think we work together to land a PR establishing the patterns. After, can work on the other signals in parallel. Can also work on wiring this into the collectors and configuration in parallel after the first signal is landed.
@laurit let me know how you want to proceed. Happy to run with any of this or let you adjust scope to land the first PR, depending on your availability.
if you're up for it, can you open a new PR with the scope reduced to spans? If not, I'm happy to help.
I'll start working on it tomorrow.
FYI @laurit - I'm ready to review this whenever you give me the heads up. I still see it has draft status and with a bunch of the metrics changes, which I think we should hold off on to make the review more manageable, so holding off unless until I hear otherwise.
This has been superseded by other PRs.