apm-server icon indicating copy to clipboard operation
apm-server copied to clipboard

Introduce efficient codec for model events

Open axw opened this issue 4 years ago • 3 comments

Introduce an efficient binary codec for model events (at least transactions and spans), for persisting to local storage. This would be used in tail-based sampling, and perhaps in other processing such as https://github.com/elastic/apm-server/issues/5936.

For the most efficient codec, we should use a statically defined encoding. Barring arguments made for alternatives, let's use protobuf.

In order to have a statically defined encoding we'll need our model to be stable. The events we produce are now substantially ECS-compliant (which is stable), with a few exceptions. For example, we emit transaction.duration.us or span.duration.us; ideally we would record only the ECS event.duration field (https://github.com/elastic/apm-server/issues/5999).

I suggest we split this up into two phases:

Phase 1

  • Continue to maintain package model by hand, and continue to use this to emitting JSON documents
  • Introduce protobuf definitions, generate code in a separate package with protoc
  • Generate code to translate between handwritten model types and protoc-generated Go types

Phase 2

  • Use protobuf to generate JSON encoding code (https://github.com/elastic/apm-server/issues/3565)
  • Investigate using protoc-generated Go types everywhere, and removing handwritten model types

For phase 2 we'll need some more changes to remove any remaining logic in the model package, e.g. setting <event>.duration.us. This can be done in an ingest pipeline.

axw avatar Aug 31 '20 08:08 axw

I spent a bit of time looking into this, with https://github.com/elastic/apm-server/compare/master...axw:model-protobuf being the fruits of my labour.

I wrote a generator which parses the model types and generates protobuf definitions, and then used plain old golang/protobuf to generate the code. I was going to use gogo/protobuf but had some probably-PEBKAC issues and ended up abandoned that for this short spike.

For generating protobuf correctly we would probably need to check them in and take the old definitions into account, so we don't accidentally remove fields and break compatibility with previous releases. Another option would be to have hand-written protobuf as the source of truth for our model types and generate the model types instead, using standard protoc tooling. That would be more typical and almost certainly less error prone. It might be painful to migrate though.

While testing I also realised that the model types we have do not have any encoding/json tags, and so we end up encoding all the empty fields. I added some json:",omitempty" tags to a bunch of fields to reduce this.

I compared this with the existing JSON codec, but bear in mind that the benchmarks are far from comprehensive. They currently only encode model.Transactions with the TraceID and ID fields set, and nothing else. With a more comprehensive set of fields, we might observe a different outcome. Nevertheless, here's what I found.

name                                      time/op
ReadEvents/json_codec/0_events-12            739ns ± 5%
ReadEvents/json_codec/1_events-12           7.59µs ± 7%
ReadEvents/json_codec/10_events-12          65.3µs ±21%
ReadEvents/json_codec/100_events-12          415µs ±24%
ReadEvents/json_codec/1000_events-12        7.20ms ±12%
ReadEvents/protobuf_codec/0_events-12       1.00µs ±24%
ReadEvents/protobuf_codec/1_events-12       5.15µs ± 3%
ReadEvents/protobuf_codec/10_events-12      35.0µs ±27%
ReadEvents/protobuf_codec/100_events-12      320µs ±20%
ReadEvents/protobuf_codec/1000_events-12    2.55ms ±32%
ReadEvents/nop_codec/0_events-12             833ns ±11%
ReadEvents/nop_codec/1_events-12            3.91µs ±12%
ReadEvents/nop_codec/10_events-12           24.8µs ±24%
ReadEvents/nop_codec/100_events-12           160µs ± 5%
ReadEvents/nop_codec/1000_events-12         1.52ms ±38%
WriteTransaction/json_codec-12              3.67µs ±10%
WriteTransaction/protobuf_codec-12           900ns ± 5%
WriteTransaction/nop_codec-12                377ns ± 7%

name                                      alloc/op
ReadEvents/json_codec/0_events-12             360B ± 0%
ReadEvents/json_codec/1_events-12           3.17kB ± 0%
ReadEvents/json_codec/10_events-12          24.2kB ± 0%
ReadEvents/json_codec/100_events-12          234kB ± 0%
ReadEvents/json_codec/1000_events-12        1.34MB ± 0%
ReadEvents/protobuf_codec/0_events-12         360B ± 0%
ReadEvents/protobuf_codec/1_events-12       2.88kB ± 0%
ReadEvents/protobuf_codec/10_events-12      21.2kB ± 0%
ReadEvents/protobuf_codec/100_events-12      205kB ± 0%
ReadEvents/protobuf_codec/1000_events-12    1.53MB ± 0%
ReadEvents/nop_codec/0_events-12              360B ± 0%
ReadEvents/nop_codec/1_events-12            2.30kB ± 0%
ReadEvents/nop_codec/10_events-12           15.5kB ± 0%
ReadEvents/nop_codec/100_events-12           147kB ± 0%
ReadEvents/nop_codec/1000_events-12         1.10MB ± 0%
WriteTransaction/json_codec-12                624B ± 0%
WriteTransaction/protobuf_codec-12            640B ± 0%
WriteTransaction/nop_codec-12                 256B ± 0%

name                                      allocs/op
ReadEvents/json_codec/0_events-12             8.00 ± 0%
ReadEvents/json_codec/1_events-12             27.0 ± 0%
ReadEvents/json_codec/10_events-12             136 ± 0%
ReadEvents/json_codec/100_events-12          1.22k ± 0%
ReadEvents/json_codec/1000_events-12         6.55k ± 0%
ReadEvents/protobuf_codec/0_events-12         8.00 ± 0%
ReadEvents/protobuf_codec/1_events-12         25.0 ± 0%
ReadEvents/protobuf_codec/10_events-12         116 ± 0%
ReadEvents/protobuf_codec/100_events-12      1.02k ± 0%
ReadEvents/protobuf_codec/1000_events-12     4.52k ± 0%
ReadEvents/nop_codec/0_events-12              8.00 ± 0%
ReadEvents/nop_codec/1_events-12              20.0 ± 0%
ReadEvents/nop_codec/10_events-12             66.0 ± 0%
ReadEvents/nop_codec/100_events-12             516 ± 0%
ReadEvents/nop_codec/1000_events-12          1.32k ± 0%
WriteTransaction/json_codec-12                6.00 ± 0%
WriteTransaction/protobuf_codec-12            6.00 ± 0%
WriteTransaction/nop_codec-12                 4.00 ± 0%

name                                      speed
ReadEvents/json_codec/0_events-12
ReadEvents/json_codec/1_events-12         44.2MB/s ± 8%
ReadEvents/json_codec/10_events-12        52.3MB/s ±24%
ReadEvents/json_codec/100_events-12       82.7MB/s ±28%
ReadEvents/json_codec/1000_events-12      46.9MB/s ±11%
ReadEvents/protobuf_codec/0_events-12
ReadEvents/protobuf_codec/1_events-12     14.8MB/s ± 3%
ReadEvents/protobuf_codec/10_events-12    22.3MB/s ±23%
ReadEvents/protobuf_codec/100_events-12   24.3MB/s ±22%
ReadEvents/protobuf_codec/1000_events-12  31.0MB/s ±30%
ReadEvents/nop_codec/0_events-12
ReadEvents/nop_codec/1_events-12
ReadEvents/nop_codec/10_events-12
ReadEvents/nop_codec/100_events-12
ReadEvents/nop_codec/1000_events-12
WriteTransaction/json_codec-12            84.9MB/s ±10%
WriteTransaction/protobuf_codec-12        57.8MB/s ± 5%
WriteTransaction/nop_codec-12

Overall the protobuf implementation is faster for both reading and writing, and encodes as fewer bytes (see the MB/s metrics). The protobuf implementation produces fewer allocations, but allocates more memory overall. I imagine this is due to the intermediate protobuf-generated types; we might be able to optimise here, I haven't looked into it.

I think this is worth pursuing, but we should probably do some real-world performance testing before embarking on an epic journey.

axw avatar Oct 29 '20 10:10 axw

@axw can you update the issue with open pre-conditions?

simitt avatar Dec 31 '21 14:12 simitt

@simitt I've added some detail to the description. I think this is unblocked now.

axw avatar Jan 04 '22 05:01 axw

See also https://github.com/elastic/apm-data/issues/36

axw avatar May 12 '23 05:05 axw

I think we covered everything on the list. Closing this.

kruskall avatar Jun 23 '23 17:06 kruskall