hono icon indicating copy to clipboard operation
hono copied to clipboard

hono constantly fails to publish messages to the kafka broker without being able to recover

Open JeffreyThijs opened this issue 2 years ago • 4 comments

Hi,

We recently stumbled upon an occurrence where hono suddenly started to failing to publish messages to our kafka broker. Our kafka broker was working perfectly fine upon discovering this issue but we do not know for sure if a small disturbance of the kafka broker in the mean time might have caused this issue. Nevertheless, we would assume that hono can recover from this since dropping messages is really not desirable. However, when restarting the pod and coming back live the adapter started working as before and the issue resolved itself.

Sadly, we do not have saved any logs of this issue and also have not been able to reproduce this issue. Although, i tried to look into the code I encountered the following which might be the culprit of why the system was not able to recover:

The following condition determines whether the cached KafkaProducer will be closed or not:

https://github.com/eclipse-hono/hono/blob/02b0a917a39cad5310c216dd1b344cb1f0b552cf/clients/kafka-common/src/main/java/org/eclipse/hono/client/kafka/producer/CachingKafkaProducerFactory.java#L210

This condition is determined by:

https://github.com/eclipse-hono/hono/blob/master/clients/kafka-common/src/main/java/org/eclipse/hono/client/kafka/producer/CachingKafkaProducerFactory.java#L259

So the condition to invalidate the cached KafkaProducer (which might be essential to being able to recover from failed publishes to the kafka broker) is only done if one of those error is thrown which raises some questions:

  • why these particular errors?
  • where they empirically determined?
  • are any of the others https://kafka.apache.org/11/javadoc/org/apache/kafka/common/errors/package-summary.html not considered as fatal?
  • why not specify the errors when it shouldn't reset?
  • isn't this condition dangerous for newly defined errors?
  • isn't there a better metric to determine whether to remove the KafkaProducer from the cache (like consecutive number of message that could not be produce, etc)?

On the other hand, maybe I am tunnel visioned and this might be caused by something else?

Any comments are well appreciated!

JeffreyThijs avatar Sep 20 '23 15:09 JeffreyThijs

@calohmn would you mind taking a look? I believe this falls into your area of expertise :-)

sophokles73 avatar Sep 26 '23 06:09 sophokles73

I think the exceptions being checked in the isFatalError method got chosen because they are mentioned as fatal exceptions in the KafkaProducer javadoc. Looking at the different kinds of Kafka exceptions, I don't currently see any other Kafka exception to include in this isFatalError method. Of course there could potentially also be other kinds of exceptions like IllegalStateException, IllegalArgumentException, NullPointerException (caused by a bug in the (Vertx-) Kafka client) or an OutOfMemoryError leading to a defunct state of the producer. In that sense, we could consider treating any non KafkaException also as a fatal error, as a precautionary measure.

I would be better to have an idea though what went wrong here (also because I haven't see any cases yet where a Hono Kafka producer stopped working and the pod had to be restarted).

@JeffreyThijs You also don't have any tracing data for this case? Have you set any Kafka producer config properties in your protocol adapter config?

calohmn avatar Sep 29 '23 10:09 calohmn

Sorry for the late reply.

Unfortunately, our logging stack was not in place when this problem occured so we don't have any tracing data from the incident. It might indeed be a good idea to treat non kafka errors also a fatal errors (or just explicitly list the errors who should not be handled as a fatal error) in order to avoid keeping discarding messages due to an unrecoverable error.

JeffreyThijs avatar Oct 12 '23 11:10 JeffreyThijs

@JeffreyThijs is this still an issue?

sophokles73 avatar Sep 24 '24 07:09 sophokles73

@sophokles73, we haven't encountered this issue again so feel free to close it.

JeffreyThijs avatar Nov 19 '24 12:11 JeffreyThijs

@sophokles73 @calohmn We re-encountered the same error recently. In this case, it was because the Hono kafka credentials were incorrect for a small period of time, causing a SASL Authentication error. After the credential was valid again, Hono did not retry the connection, which resulted in no messages being published on its topics.

We found out that by adding the following, Hono correctly retries the connection and recovers from the brief downtime:

    public static boolean isFatalError(final Throwable error) {
        return error instanceof ProducerFencedException
                || error instanceof OutOfOrderSequenceException
                || error instanceof AuthorizationException
+               || error instanceof AuthenticationException
                || error instanceof UnsupportedVersionException
                || error instanceof UnsupportedForMessageFormatException;
    }

Maybe it is worth looking into if there are any other missing exceptions, or invert the statement and only exclude exceptions we do not want to be registered as fatal?

More details on how we reproduced our issue:

  • Hono was configured to use a user with SCRAM-SHA-512
  • Our kafka cluster temporarily had the same user but with SCRAM-SHA-256
  • After setting it back to SCRAM-SHA-512 Hono does not recover the kafka connection

WatcherWhale avatar Apr 23 '25 14:04 WatcherWhale

@WatcherWhale would you mind opening a new issue for this? It would also be helpful if you could provide a link to the point in the code where you would like to make the change ...

sophokles73 avatar Apr 27 '25 13:04 sophokles73