hono icon indicating copy to clipboard operation
hono copied to clipboard

Hono does not recover from some Kafka Exceptions

Open WatcherWhale opened this issue 8 months ago • 4 comments

We see that if Hono encountes a Kafka AuthenticationException, it does not try to re-connect to the Kafka brokers and stops sending messages to any topic. This is a similar issue as #3544.

It seems Hono does not consider an AuthenticationException as a fatal error, which causes the cached producer to not be closed: https://github.com/eclipse-hono/hono/blob/02b0a917a39cad5310c216dd1b344cb1f0b552cf/clients/kafka-common/src/main/java/org/eclipse/hono/client/kafka/producer/CachingKafkaProducerFactory.java#L259-L265

If we apply the following patch to Hono we see it does correctly recover from such an exception:

    public static boolean isFatalError(final Throwable error) {
        return error instanceof ProducerFencedException
                || error instanceof OutOfOrderSequenceException
                || error instanceof AuthorizationException
+               || error instanceof AuthenticationException
                || error instanceof UnsupportedVersionException
                || error instanceof UnsupportedForMessageFormatException;
    }
  • Are there any other Exceptions that should close the cached producer?
  • Maybe we should just exclude exceptions that are non-fatal instead of determining what is fatal. If Kafka ever adds another fatal exception, some unexpected dangerous behavior could occur.
  • Like stated #3544, there are maybe other ways to determine if a cached producer needs to be closed. Like consecutive number of message that could not be produced, ...

WatcherWhale avatar Apr 28 '25 08:04 WatcherWhale

@calohmn WDYT?

sophokles73 avatar Apr 28 '25 13:04 sophokles73

Maybe we should just exclude exceptions that are non-fatal instead of determining what is fatal. If Kafka ever adds another fatal exception, some unexpected dangerous behavior could occur.

That might indeed put us on the safe side. However, I have no clue what kind of errors would/should be on the non-fatal list ...

sophokles73 avatar Apr 28 '25 13:04 sophokles73

Hmm, looking at the code, it seems strange to me, why the AuthenticationException hadn't been added as well back then. I can't remember any discussions on this back then.

Maybe we should just exclude exceptions that are non-fatal instead of determining what is fatal.

Maybe. The list of non-fatal exceptions is much longer though. At first glance, such a list doesn't really look easily manageable to me.

If Kafka ever adds another fatal exception, some unexpected dangerous behavior could occur.

Yes. My guess would be that such exceptions don't get added often, though. We could add a safe-guard for noticing such a scenario here maybe: Since all fatal-error exception classes are in the org.apache.kafka.common.errors package, we could add a unit-test that determines all class-names in this package and compares it to the pre-determined list for the current Kafka version. When updating Kafka and when there is an added Exception class, the test would fail and the error could indicate that the added exception should be checked whether to add it to the fatal list.

calohmn avatar May 11 '25 17:05 calohmn

Maybe it is best if we first implement the patch for the auth exception, while we still discuss how we can detect new fatal exceptions. I am willing to contribute that.

@calohmn that is indeed something we could do, hopefully the kafka team does not introduce any new exceptions that regularly, otherwise this could become a hassle. I also do not have a clue how I should implement this.

WatcherWhale avatar Jun 20 '25 07:06 WatcherWhale