Reduce heap and GC overhead
Reduce heap and GC overhead by rendering packets on the caller threads and queuing byte arrays or byte buffers.
The messages are currently rendered to instances of java.lang.String and are queued up in StatsDSender. Rendering the packets on caller threads and queuing byte[] or ByteBuffer instances will:
- Yield a 2x reduction in heap volume occupied by potentially not-so-short-lived objects
- Reduce the GC load - smaller message objects are cheaper to copy between generations
- Reduce the amount of work done on the
StatsDSenderthread and increase the throughput
We could also skip the intermediate StringBuilder representation and render straight to ByteBuffer.
We have merged and released https://github.com/DataDog/java-dogstatsd-client/pull/105 to reduce the number of allocations, we now rely on StringBuilders differently to achieve this.
We can probably do even better, and your suggestion is still very much valid, but for now this was a low impact alternative for our customers that would work without any code changes.
We have a some more work in the pipe, so hopefully we can reduce GC load even further. Thank you for your insight @deadok22 🙇
Hi @truthbk! Thanks for following up on this.
I took a glance at #105 and I must say I fail to see how that change is a performance improvement in some aspects of it. Let me elaborate:
- The messages are now relayed from the caller threads to the
StatsDSenderthread usingArrayBlockingQueue<Message>rather thanArrayBlockingQueue<String>.- The
StringandString[]objects used to represent tags are now retained until the message is sent. In the previous version only 2 potentially long-lived objects were used to represent a message (theStringand itschar[]) - The
Messageobject representation comes with an additional object allocation. It is also retained until the message is sent.
- The
- The messages themselves are now formatted on the
StatsDSenderthread. The old code performed formatting on the caller threads, so the throughput may have reduced as a result of this change. (see the third suggestion in this issue - the opposite was done in #105)
It seems the new version may incur considerably more load on the GC than the old one in some scenarios.
I just realized what I referred to as the StatsDSender thread is now a different thing (apologies - I have an older version of the code in my head) - a set of threads responsible for rendering the messages into ByteBuffer. So a message's lifecycle looks like this:
- A caller thread creates a
Messageobject and puts it to a queue - A processor thread gets a
Messageobject from that queue, renders it into aByteBuffer, and puts it to another queue - The sender thread gets the
ByteBufferand sends it over to the agent
Although more work is now parallelized and some work is loaded off the caller threads, I'm concerned with the extra synchronization, allocation, and heap overhead the new processing model incurs.
What benchmarks were used for evaluating the new processing model and the changes in #105?