kafka-go icon indicating copy to clipboard operation
kafka-go copied to clipboard

the kafka reader got an unknown error reading partition X of Y at offset Z: multiple Read calls return no data or error

Open veqryn opened this issue 3 years ago • 1 comments

Describe the bug We have 2 different topics on our kafka cluster. Both have multiple partitions, and both see millions of messages per day, with one of them almost hitting a half-billion and the other in the 1-10 million range. I have two separate golang processes running, one reading from one topic, and the other reading from the other topic. Every day I see hundreds of messages in my log that look like: the kafka reader got an unknown error reading partition 46 of mytopicname at offset 929168893: multiple Read calls return no data or error

It has been this way since I started using this library, maybe 1 year ago, and I have tried to keep the library up to date. I can confirm there is data at that partition and offset:

$ ./bin/kafka-console-consumer.sh --bootstrap-server kafka2-00.xxx.net:9092 --topic mytopicname --partition 46 --offset 929168893 --max-messages 1
{"route".......

Kafka Version Kafka version 2.4.0 kafka-go version 0.4.32

To Reproduce

kafkaReader := kafka.NewReader(kafka.ReaderConfig{
		Brokers: strings.Split(env.KafkaBrokers, ","),
		GroupID: env.KafkaGroup,
		Topic:   env.KafkaTopic,
		ErrorLogger: kafka.LoggerFunc(func(s string, i ...interface{}) {
			log.WithField("error", fmt.Sprintf(s, i...)).Info("Kafka library debug")
		}),
	})

msg, err := consumer.kafkaReader.FetchMessage(ctx)
if err != nil {
	panic(err)
}
myCh <- msg

Expected Behavior Not to receive this error if this partition + offset has data, which it does. Instead, receive that data/mesage.

Observed Behavior the kafka reader got an unknown error reading partition 46 of mytopicname at offset 929168893: multiple Read calls return no data or error

veqryn avatar Jun 17 '22 00:06 veqryn

Can you share a bit more about your setup? Before blindly handling this error, it would be really nice if we had a reproducible test case, so we can figure out if there's a deeper issue that needs fixing.

dominicbarnes avatar Jul 01 '22 17:07 dominicbarnes

This was fixed with #941 by handling this error by closing the connection. It will suppress the error log and metrics, but will still follow the normal retry policy wrapping this logic, so a persistent problem should still cause a failure. See v0.4.35 for this change, feel free to re-open if this needs to be addressed further.

dominicbarnes avatar Sep 09 '22 17:09 dominicbarnes