kafka-go
kafka-go copied to clipboard
ErrNoProgress due to parsing/concurrency issue?
Describe the bug
We're using the library across hundreds of instances reading from a Kafka cluster. And the rate of this ErrNoProgress errors is alarmingly high, at around 10/s across all the components. I believe they're all from this bit of logic and I noticed a change was made to handle this particular error more gracefully. But nonetheless it'd still close the connection and open a new one. Closing and opening tens of connections per second can add significant load to the Kafka cluster and impact performance.
More concretely, I wonder if the c.concurrency() == 1 check is catching some cases introduced by concurrency/parsing issues/bugs in the library as opposed to actual data corruption on the wire. There are many scenarios that could cause this symptom, for example
- A certain goroutine releases the read lock prematurely before properly advancing the buffer for the response meant for it. The next goroutine would interpret the first 8 bytes as the size and correlation id, getting gibberish values.
- A certain goroutine incorrectly parses the bytes in the buffer and either leave some bytes unprocessed in the buffer, or advance into the bytes meant for the next response.
- A certain goroutine writes the request, but calls
leavebefore processing the response. (Looks like this could be happening already.) - Any message initiated by the cluster that's not a response message and does not have a correlation id. (Still reading through the Kafka protocol; not sure if this is possible)
Something else worth mentioning is that if I add a 50ms sleep between iterations in the readloop, then this error would completely go away, something else that suggests this might be due to some sort of contention.
In other words, there's probably data corruption. But I don't believe it's introduced by the transport layer (TCP), and we should get to the bottom of that, eliminate the root cause, instead of simply dropping the connection and starting over.
Kafka Version
- What version(s) of Kafka are you testing against? 2.5.1
- What version of kafka-go are you using? v0.4.42
To Reproduce
Run many consumers and observe the increase of reader errors or, in the older version of the library, the log the kafka reader got an unknown error reading...multiple Read calls return no data or error
Expected Behavior Very few to zero ErrNoProgress due to corrupt data even when there are many consumers.
Observed Behavior Tens of reader errors due to ErrNoProgress which lead to frequent reader reconnections.
Spent a day poking around, and found the most promising theory to be the following
- A goroutine couldn't finish reading all the bytes meant for it; Leaving unfinished bytes in the buffer would cause data corruption for the next reader goroutine, causing the ErrNoProgress error here.
- The first goroutine didn't finish consuming all the bytes in the read buffer because it hit
r.empty=truehere - r.empty was set to true because
highWaterMark == offsethere - When
highWaterMark == offset, I tried loggingsizeandremaining, and saw that there's clearly bytes in the buffer (e.g. highWaterMark == offset with size 366 and remaining 279). So this goroutine that's currently holding the read lock should not just give up. - I see a previous change was made and reverted around how we handle the case where
highwatermark=offset. I tried that change, and all the ErrNoProgress errors went away. Over a couple of hours I saw one burst of UnexpectedEOF errors, which I think is what caused the revert of that change. - Upon further investigation, these UnexpectedEOF errors happen when
highwatermark=offsetandremain=0. Settingr.empty=trueonly when both conditions are true feels like the right logic.
cc`ing @rhansen2. Sorry I know it's been more than 2 years.. But wanted to keep you in the loop since you're the author of #788. Do you by any chance remember any details regarding your statement in that PR? 🙏
Seems like In newer versions of Kafka the high water mark being equal to the offset does not mean there are no messages in the batch.
Same here. After 24h of debugging I ended up with your same conclusion @zachxu42, highWaterMark == offset is not correct. HighWatermark is the last offset that is available, not the last+1.
Moreover, there are other cases where a similar thing happens. If an empty batch is received, the library does not clear the buffer correctly, and the next message that reads, gets a no progress error.
Thanks @aratz-lasa for the comment. Can you please elaborate this part?
HighWatermark is the last offset that is available, not the last+1.
Also I never fully understand the reason behind this highWaterMark == offset check. My understanding is highWaterMark is the offset for the last fully replicated message. And offset is the offset for the next message to read. When they are equal, doesn't that mean there's exactly 1 message to read? Then that should be allowed.
The twist is.. as I mentioned above, sometimes you get highWaterMark == offset AND remain=0. That means there's only 1 message to read, and that message is empty? In my PR linked above I only skip reading in that case. Actually, why checking highWaterMark at all.. If we just return an empty batch when remain=0 wouldn't that be also/more correct?
HighWatermark is the last offset that is available, not the last+1. Also I never fully understand the reason behind this highWaterMark == offset check. My understanding is highWaterMark is the offset for the last fully replicated message. And offset is the offset for the next message to read. When they are equal, doesn't that mean there's exactly 1 message to read? Then that should be allowed.
I think you both are just saying the same thing.