java-dogstatsd-client Reduce heap and GC overhead

Reduce heap and GC overhead by rendering packets on the caller threads and queuing byte arrays or byte buffers.

The messages are currently rendered to instances of java.lang.String and are queued up in StatsDSender. Rendering the packets on caller threads and queuing byte[] or ByteBuffer instances will:

Yield a 2x reduction in heap volume occupied by potentially not-so-short-lived objects
Reduce the GC load - smaller message objects are cheaper to copy between generations
Reduce the amount of work done on the StatsDSender thread and increase the throughput

We could also skip the intermediate StringBuilder representation and render straight to ByteBuffer.

Jul 09 '19 06:07 deadok22

We have merged and released https://github.com/DataDog/java-dogstatsd-client/pull/105 to reduce the number of allocations, we now rely on StringBuilders differently to achieve this.

We can probably do even better, and your suggestion is still very much valid, but for now this was a low impact alternative for our customers that would work without any code changes.

We have a some more work in the pipe, so hopefully we can reduce GC load even further. Thank you for your insight @deadok22 🙇

May 05 '20 19:05 truthbk

Hi @truthbk! Thanks for following up on this.

I took a glance at #105 and I must say I fail to see how that change is a performance improvement in some aspects of it. Let me elaborate:

The messages are now relayed from the caller threads to the StatsDSender thread using ArrayBlockingQueue<Message> rather than ArrayBlockingQueue<String>.
- The String and String[] objects used to represent tags are now retained until the message is sent. In the previous version only 2 potentially long-lived objects were used to represent a message (the String and its char[])
- The Message object representation comes with an additional object allocation. It is also retained until the message is sent.
The messages themselves are now formatted on the StatsDSender thread. The old code performed formatting on the caller threads, so the throughput may have reduced as a result of this change. (see the third suggestion in this issue - the opposite was done in #105)

It seems the new version may incur considerably more load on the GC than the old one in some scenarios.

I just realized what I referred to as the StatsDSender thread is now a different thing (apologies - I have an older version of the code in my head) - a set of threads responsible for rendering the messages into ByteBuffer. So a message's lifecycle looks like this:

A caller thread creates a Message object and puts it to a queue
A processor thread gets a Message object from that queue, renders it into a ByteBuffer, and puts it to another queue
The sender thread gets the ByteBuffer and sends it over to the agent

Although more work is now parallelized and some work is loaded off the caller threads, I'm concerned with the extra synchronization, allocation, and heap overhead the new processing model incurs.

What benchmarks were used for evaluating the new processing model and the changes in #105?

May 06 '20 00:05 deadok22