terminus icon indicating copy to clipboard operation
terminus copied to clipboard

Kafka microservice health check does not work

Open jirawat-tokenx opened this issue 3 years ago • 7 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current behavior

Configuring a microservice ping check, like:

       this.kafka.pingCheck('KAFKA', {
          name: configService.getKafkaClientName(),
          transport: Transport.KAFKA,
          options: {
            client: {
              brokers: configService.getKafkaHost(),
            },
            consumer: {
              groupId: configService.getKafkaGroupId(),
            },
          },
        });

The check will alway return "timeout of x ms exceeded" whatever how many timeout will be set.

But my service still working well with the same Microservice config.

Minimum reproduction code

       this.kafka.pingCheck('KAFKA', {
          name: configService.getKafkaClientName(),
          transport: Transport.KAFKA,
          options: {
            client: {
              brokers: configService.getKafkaHost(),
            },
            consumer: {
              groupId: configService.getKafkaGroupId(),
            },
          },
        });

Steps to reproduce

No response

Expected behavior

Health check should be return UP status when the kafka is able to connect

Package version

8.0.4

NestJS version

8.2.6

Node.js version

14.17.3

In which operating systems have you tested?

  • [X] macOS
  • [ ] Windows
  • [ ] Linux

Other

No response

jirawat-tokenx avatar Feb 20 '22 03:02 jirawat-tokenx

Hi if there is anything I can provide for this issue please let me know

jirawat-tokenx avatar Feb 26 '22 06:02 jirawat-tokenx

Hi @jirawat-tokenx, I advise you to leave a minimal reproduction of a clonable git repository so that the core team can evaluate the problem you have reported.

Tony133 avatar Feb 27 '22 22:02 Tony133

The check will always return "timeout of x ms exceeded" whatever how many timeout will be set.

This is also the case for me out of the box, however, I can get a successful response by increasing the timeout directly in the options (I can't see this in the options you shared @jirawat-tokenx)

This is what I'm doing to make it work.

const kafkaOptions = {
    transport: kafkaConfig.transport,
    options: kafkaConfig.options,
    timeout: 10000,
};
    
this.microservice.pingCheck<KafkaOptions>('Kafka', kafkaOptions)

That being said, the response time is unmanageable. A ping check that is taking ~10 seconds to give back a response is absolutely way too much. Is that something that can be looked at @Tony133?

For context, DB and HTTP health indicators that I'm using (from this same package) have a response time of ~1 second.

fer8a avatar Mar 25 '22 14:03 fer8a

Hi @fer8a, i think maybe that's something that can be improved, however if you can create a minimal reproduction in a clonable git repository the Nest Core Team can better assess the issue.

Tony133 avatar Mar 25 '22 22:03 Tony133

Hello @Tony133 sorry for the late reply. It's been some busy days.

Take a look at the following respository It's a simple example exposing the health indicator at the base route /. You can use it for the tests.

fer8a avatar Mar 30 '22 21:03 fer8a

Hi all, I tried @fer8a minimum reproduction, I did some tests setting the timeout up to 5140 gives the following error (see screenshot):

screenshot

Log terminal: Schermata 2022-03-31 alle 19 04 35

while setting the timeout on 5200 and increasing it, the status check works correctly (see screenshot):

screenshot

Log Terminal: screenshot

This is a brief summary of the test I took. You need to understand if this is a specific problem that affects only the @nestjs/microservices package or if it also affects the @nestjs/terminus package as a whole. Since internally for microservices terminus uses the @nestjs/microservices package.

P.s. For the Kafka service I used a docker container.

Tony133 avatar Mar 31 '22 17:03 Tony133

Hi everyone. I found that kafka pingCheck works if added producerOnlyMode: true flag. Think, for those who're looking for a solution, it could be helpful.

Example:

this.microservice.pingCheck('kafka', {
          transport: Transport.KAFKA,
          options: {
            ...this.configService.get('kafka.options'),
            producerOnlyMode: true,
      },
}),

ohudenkoongage avatar Jul 01 '22 07:07 ohudenkoongage

@Tony133 If you have time, please check https://github.com/jlahtinen/nestjskafkaterminus

I think this is about as simple as it can get to reproduce described problem. I have not idea how to fix this.

If producerOnlyMode: true is the fix, it should be documented and make clear are the consequences using it (for me all seems to be working with that on localhost, but on k8 all starts to fail again like without producerOnlyMode).

jlahtinen avatar Oct 26 '22 18:10 jlahtinen

Hi @jlahtinen, I tried your minimal reproduction in a local environment with docker and managed to get it working, but I had to increase the timeout to 9000ms or more, using NestJS v9.x and Terminus v9.x compared to my message I had posted above with NestJS v8.x and Terminus v8.x.

I attach here two screenshots:

Schermata 2022-10-28 alle 18 52 51 Schermata 2022-10-28 alle 18 53 51

I hope I have been of some help

Tony133 avatar Oct 28 '22 19:10 Tony133

@Tony133 Thank you. Works as you described.

jlahtinen avatar Nov 01 '22 05:11 jlahtinen

A big problem with this is that the check creates a new connection, causing Kafka to rebalance the consumers, then closes it, which triggers another rebalance.

I think, it's an issue with the approach itself: instead of checking if the existing connection is healthy, the indicator checks if Kafka is connectable, so we don't really know if the microservice itself is still connected.

Ugzuzg avatar Jan 10 '23 10:01 Ugzuzg

Yeah, I agree that it's not actually terminus issue. It's natural state of things for Kafka. If you want to check both producer and consumer connections you have to take into account that it takes some time to join consumer group and make consumer connection stable. But you can just use producerOnlyMode to significantly speed up things (as suggested above). It should give you a necessary amount of confidence that your microservice is able to reach Kafka cluster.

@Tony133 I'd probably consider introducing default config for Kafka transport consumer config with producerOnlyMode set to true. Or, if we don't want to mutate health check config from inside of the library we could think of adding warning note to the documentation.

EugeneKorshenko avatar Jan 24 '23 19:01 EugeneKorshenko

producerOnlyMode will be set per default to true now. We've also added additional e2e tests to ensure this behavior. If you already have producerOnlyMode set to false it will overwrite the default behavior.

Released with v10.1.0 🎉

BrunnerLivio avatar Sep 14 '23 21:09 BrunnerLivio