terminus Kafka microservice health check does not work

Is there an existing issue for this?

[X] I have searched the existing issues

Current behavior

Configuring a microservice ping check, like:

       this.kafka.pingCheck('KAFKA', {
          name: configService.getKafkaClientName(),
          transport: Transport.KAFKA,
          options: {
            client: {
              brokers: configService.getKafkaHost(),
            },
            consumer: {
              groupId: configService.getKafkaGroupId(),
            },
          },
        });

The check will alway return "timeout of x ms exceeded" whatever how many timeout will be set.

But my service still working well with the same Microservice config.

Minimum reproduction code

       this.kafka.pingCheck('KAFKA', {
          name: configService.getKafkaClientName(),
          transport: Transport.KAFKA,
          options: {
            client: {
              brokers: configService.getKafkaHost(),
            },
            consumer: {
              groupId: configService.getKafkaGroupId(),
            },
          },
        });

Steps to reproduce

No response

Expected behavior

Health check should be return UP status when the kafka is able to connect

Package version

8.0.4

NestJS version

8.2.6

Node.js version

14.17.3

In which operating systems have you tested?

[X] macOS
[ ] Windows
[ ] Linux

Other

No response

Feb 20 '22 03:02 jirawat-tokenx

Hi if there is anything I can provide for this issue please let me know

Feb 26 '22 06:02 jirawat-tokenx

Hi @jirawat-tokenx, I advise you to leave a minimal reproduction of a clonable git repository so that the core team can evaluate the problem you have reported.

Feb 27 '22 22:02 Tony133

The check will always return "timeout of x ms exceeded" whatever how many timeout will be set.

This is also the case for me out of the box, however, I can get a successful response by increasing the timeout directly in the options (I can't see this in the options you shared @jirawat-tokenx)

This is what I'm doing to make it work.

const kafkaOptions = {
    transport: kafkaConfig.transport,
    options: kafkaConfig.options,
    timeout: 10000,
};
    
this.microservice.pingCheck<KafkaOptions>('Kafka', kafkaOptions)

That being said, the response time is unmanageable. A ping check that is taking ~10 seconds to give back a response is absolutely way too much. Is that something that can be looked at @Tony133?

For context, DB and HTTP health indicators that I'm using (from this same package) have a response time of ~1 second.

Mar 25 '22 14:03 fer8a

Hi @fer8a, i think maybe that's something that can be improved, however if you can create a minimal reproduction in a clonable git repository the Nest Core Team can better assess the issue.

Mar 25 '22 22:03 Tony133

Hello @Tony133 sorry for the late reply. It's been some busy days.

Take a look at the following respository It's a simple example exposing the health indicator at the base route /. You can use it for the tests.

Mar 30 '22 21:03 fer8a

Hi all, I tried @fer8a minimum reproduction, I did some tests setting the timeout up to 5140 gives the following error (see screenshot):

Log terminal: Schermata 2022-03-31 alle 19 04 35

while setting the timeout on 5200 and increasing it, the status check works correctly (see screenshot):

Log Terminal: screenshot

This is a brief summary of the test I took. You need to understand if this is a specific problem that affects only the @nestjs/microservices package or if it also affects the @nestjs/terminus package as a whole. Since internally for microservices terminus uses the @nestjs/microservices package.

P.s. For the Kafka service I used a docker container.

Mar 31 '22 17:03 Tony133

Hi everyone. I found that kafka pingCheck works if added producerOnlyMode: true flag. Think, for those who're looking for a solution, it could be helpful.

Example:

this.microservice.pingCheck('kafka', {
          transport: Transport.KAFKA,
          options: {
            ...this.configService.get('kafka.options'),
            producerOnlyMode: true,
      },
}),

Jul 01 '22 07:07 ohudenkoongage

@Tony133 If you have time, please check https://github.com/jlahtinen/nestjskafkaterminus

I think this is about as simple as it can get to reproduce described problem. I have not idea how to fix this.

If producerOnlyMode: true is the fix, it should be documented and make clear are the consequences using it (for me all seems to be working with that on localhost, but on k8 all starts to fail again like without producerOnlyMode).

Oct 26 '22 18:10 jlahtinen

Hi @jlahtinen, I tried your minimal reproduction in a local environment with docker and managed to get it working, but I had to increase the timeout to 9000ms or more, using NestJS v9.x and Terminus v9.x compared to my message I had posted above with NestJS v8.x and Terminus v8.x.

I attach here two screenshots:

I hope I have been of some help

Oct 28 '22 19:10 Tony133

@Tony133 Thank you. Works as you described.

Nov 01 '22 05:11 jlahtinen

A big problem with this is that the check creates a new connection, causing Kafka to rebalance the consumers, then closes it, which triggers another rebalance.

I think, it's an issue with the approach itself: instead of checking if the existing connection is healthy, the indicator checks if Kafka is connectable, so we don't really know if the microservice itself is still connected.

Jan 10 '23 10:01 Ugzuzg

Yeah, I agree that it's not actually terminus issue. It's natural state of things for Kafka. If you want to check both producer and consumer connections you have to take into account that it takes some time to join consumer group and make consumer connection stable. But you can just use producerOnlyMode to significantly speed up things (as suggested above). It should give you a necessary amount of confidence that your microservice is able to reach Kafka cluster.

@Tony133 I'd probably consider introducing default config for Kafka transport consumer config with producerOnlyMode set to true. Or, if we don't want to mutate health check config from inside of the library we could think of adding warning note to the documentation.

Jan 24 '23 19:01 EugeneKorshenko

producerOnlyMode will be set per default to true now. We've also added additional e2e tests to ensure this behavior. If you already have producerOnlyMode set to false it will overwrite the default behavior.

Released with v10.1.0 🎉

Sep 14 '23 21:09 BrunnerLivio