kafka icon indicating copy to clipboard operation
kafka copied to clipboard

Synchronous protocol deserialization blocking event loop

Open darthjoel opened this issue 8 years ago • 7 comments

I am running a system that processes about 3M messages per day. It uses Node 0.10, Kafka 0.9 and high level consumers using no-kafka 2.5.6. When fetch sizes get large, the consumers start to rebalance infinitely due to event loop latency spiking to high levels (over 1 sec) as reported using event-loop-monitor. The culprit of the latency appears to be the synchronous call to deserialize a fetch response from within client.js: https://github.com/oleksiyk/kafka/blob/v2.5.6/lib/client.js#L337

I wrote a simple test case that demonstrates this. The call is taking 250ms-1.3 secs with the test fetch responses included in my simple test case. Given the fetch size of 1MB is the default value and the node event loop shouldn't be blocked for >250ms, it appears to be a performance-related bug in the way the protocol is deserialized.

Simple test case: https://github.com/darthjoel/no-kafka-test-case

darthjoel avatar Jul 18 '16 15:07 darthjoel

The protocol in no-kafka is synchronous in both encode and decode. I don't see its being re-written in the nearest future unfortunately.

oleksiyk avatar Jul 18 '16 15:07 oleksiyk

The solution is to decrease the fetch size, increase number of consumers, use async compression to reduce number of individual messages decoded by synchronous protocol.

oleksiyk avatar Jul 18 '16 15:07 oleksiyk

Thanks for the quick response. Due to message size requirements, I cannot drop the fetch size below 1MB. I updated my test case -- take a look. On a 1MB fetch size (again the library default), it takes 250ms for reading the buffer and another 2.5 seconds for _mapTopics which is also synchronous. With a 6MB fetch size that number goes up to 32 seconds! The event loop can't be blocked for seconds or the consumer heartbeats don't get sent out triggering rebalancing. Improving the _mapTopics function alone could have a big effect on performance I think. I'm going to see what I can do, but any suggestions would be appreciated.

darthjoel avatar Jul 18 '16 18:07 darthjoel

[EDIT] Huge perf improvement in Client._mapTopics by moving from Node 0.10 to Node 4.4. Has nothing to do with version of lodash as previously stated in this comment. Timing of calls to the _.merge function go from 2.3secs to 100ms. The speed of reading and decoding a buffer (protocol.read(data).FetchResponse().result) remains unchanged and still pretty slow (250ms for 1MB) for a synchronous call.

darthjoel avatar Jul 18 '16 19:07 darthjoel

Thanks, I'll have a look

oleksiyk avatar Jul 19 '16 13:07 oleksiyk

I did a bunch more testing. With a 1MB fetch response that contains lots of very small messages, here are the timings of the synchronous function calls:

reading 1MB buffer took 317ms _mapTopics 1MB took 94ms

With a 6MB fetch response also containing very small messages:

reading 6MB buffer took 1745ms _mapTopics 6MB took 437ms

However with a 1.2MB fetch response containing messages of 1K in size the time goes down significantly:

reading 1.2MB buffer took 49ms _mapTopics 1.2MB took 10ms

Here is a 4MB fetch response containing 1K messages:

reading 4MB buffer took 164ms _mapTopics 4MB took 34ms

The bottom line is that if you have #1 lots of very small messages or #2 a large-ish fetch size (which may be needed if your use case requires very variable sized messages), then the synchronous calls to decode the fetch response can introduce a lot of event loop lag which can cause high level consumers to infinitely rebalance (happened to me).

However if you have a 1MB fetch size with 1K messages you should be fine. I am using a 4MB fetch size with 1K+ message sizes, and it seems to be fine with some event loop delays up to 2 seconds but not high enough to cause the consumers to rebalance.

Eventually it would be good to make the decoding async, but it is completely workable now as long as you monitor consumer event loop lag.

darthjoel avatar Aug 01 '16 14:08 darthjoel

I updated my test case on Github, so you should be able to reproduce the results if you run with Node 4.x. Running with Node 0.10 (which I know is really old but was using in production) will show much worse performance due to the slowness of _.merge()

darthjoel avatar Aug 01 '16 14:08 darthjoel