kafka-go icon indicating copy to clipboard operation
kafka-go copied to clipboard

Stuck Write Message when there is a issue "multiple Read calls return no data or error"

Open anhtuanluu36 opened this issue 5 years ago • 10 comments

I have a strange issue Reproduce Steps:

  1. I call w.WriteMessages(ctx, msgs...)
  2. I have an error "multiple Read calls return no data or error"
  3. I call w.WriteMessages(ctx, msgs...) and it is stuck here

Here is my code:

 kafka.NewWriter(kafka.WriterConfig{
				Brokers:  brokers,
				Topic:    topic,
				Balancer: &kafka.Hash{},
				Dialer: &kafka.Dialer{
					Timeout:   10 * time.Second,
					DualStack: true,
					TLS:       tlsConfig,
				},
			})

Sometimes I see this error on Kafka:

 [2020-04-09 14:05:40,353] WARN Failed to send SSL Close message  (org.apache.kafka.common.network.SslTransportLayer)
java.io.IOException: Broken pipe
	at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at java.base/sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113)
	at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:79)
	at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:50)
	at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:466)
	at org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:210)
	at org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:173)
	at org.apache.kafka.common.utils.Utils.closeAll(Utils.java:718)
	at org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:61)
	at org.apache.kafka.common.network.Selector.doClose(Selector.java:748)
	at org.apache.kafka.common.network.Selector.close(Selector.java:736)
	at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:534)
	at org.apache.kafka.common.network.Selector.poll(Selector.java:424)
	at kafka.network.Processor.poll(SocketServer.scala:665)
	at kafka.network.Processor.run(SocketServer.scala:582)
	at java.base/java.lang.Thread.run(Thread.java:834)

Do you have any ideas?

anhtuanluu36 avatar Apr 09 '20 05:04 anhtuanluu36

I think it relate to use SSL and send a lot of messages at the same time. It works fine when I only push 5 messages with the method "w.WriteMessages"

anhtuanluu36 avatar Apr 10 '20 06:04 anhtuanluu36

What version are you on?

I ran into a recent problem where any error on writing caused the driver to retry all messages. However the driver also wrote all errors to a dedicated error channel. Since the channel was read by the main routine, the error was immediately returned, however the go routine retrying the write request on the next error was stuck waiting on the channel to be read (which would never happen, because the reading routine had returned). This caused the writing to hang forever.

pandaadb avatar Aug 19 '20 07:08 pandaadb

The same thing just happened to me, I believe. I had a network fault where the leader data for Kafka was stale. Then I saw this error, and my writer thread seemed to hang forever.

2020-08-25 14:47:37.713482500 ERROR: writer.go:745: error writing messages to TopicRedacted (partition 0): [6] Not Leader For Partition: the client attempted to send messages to a replica that is not the leader for some partition, the client's metadata are likely out of date
2020-08-25 14:47:37.814596500 ERROR: writer.go:745: error writing messages to TopicRedacted (partition 0): multiple Read calls return no data or error
2020-08-25 14:47:38.214899500 ERROR: writer.go:745: error writing messages to TopicRedacted (partition 0): multiple Read calls return no data or error

Presumably I've encountered the same error as you have. I haven't been able to dig down into it just yet, but it looks like the same problem.

Edit: It was the same problem. Got fixed by updating.

orn1983 avatar Aug 25 '20 20:08 orn1983

Hello, thank you all for reporting.

Would you be able to mention which version of kafka-go you're experiencing the issue with?

achille-roussel avatar Sep 29 '20 09:09 achille-roussel

I also encountered this problem. The version of kafka-go is 0.4.17, the version of kafka is 2.13-2.8.0. I triggered the problem when writing plenty of data. My kafka doesn't config TLS.

NitroCao avatar Aug 26 '21 09:08 NitroCao

@pandaadb do you have more details to share about this issue? Is the channel you are referring to managed within kafka-go?

I ran into a recent problem where any error on writing caused the driver to retry all messages. However the driver also wrote all errors to a dedicated error channel. Since the channel was read by the main routine, the error was immediately returned, however the go routine retrying the write request on the next error was stuck waiting on the channel to be read (which would never happen, because the reading routine had returned). This caused the writing to hang forever.

achille-roussel avatar Aug 31 '21 22:08 achille-roussel

Hard to say if it is the same issue, but I get multiple Read calls return no data or error from batch.Close when closing a kafka.Batch with unread messages.

vmihailenco avatar Sep 17 '21 09:09 vmihailenco

This is a recurring issue for me, though I am not able to reproduce it manually. It seems to always be the came cadence. Starts out with:

2022-01-29 16:47:30.393884500 ERROR: writer.go:745: error writing messages to TOPIC (partition N): [6] Not Leader For Partition: the client attempted to send messages to a replica that is not the leader for some partition, the client's metadata are likely out of date

Followed by:

2022-01-29 16:47:30.496844500 ERROR: writer.go:745: error writing messages to TOPIC (partition N): multiple Read calls return no data or error
2022-01-29 16:47:30.897083500 ERROR: writer.go:745: error writing messages to TOPIC (partition N): multiple Read calls return no data or error

There seems to have been a network fault and one of the zookeeper nodes lost connectivity to the cluster, to reconnect seconds later. This triggered a re-election and for some reason the client seems to be unaware of it and tries to write to a follower. Upon failing, it just gives up.

When I try to recreate this, the client always seems to just be aware of all of this and just recover. I can't figure out what's different.

Am I missing some obvious thing where I should be performing some manual action?

orn1983 avatar Jan 30 '22 18:01 orn1983

I see I wrote above (August 25th) that it got fixed by updating, but apparently I just wasn't able to recreate it.

orn1983 avatar Jan 30 '22 18:01 orn1983

Network connectivity issues can manifest in complex ways, I think the work in https://github.com/segmentio/kafka-go/pull/846 will help better surface the conditions that triggered those errors.

achille-roussel avatar Feb 15 '22 18:02 achille-roussel

i also got this same issue- [6] Not Leader For Partition: the client attempted to send messages to a replica that is not the leader for some partition, the client's metadata are likely out of date },

github.com/segmentio/kafka-go v0.4.38

There is no load in the kafka consumer OR publisher as such.

NehaSachan avatar Jan 09 '23 05:01 NehaSachan