Kafka microservice health check does not work
Is there an existing issue for this?
- [X] I have searched the existing issues
Current behavior
Configuring a microservice ping check, like:
this.kafka.pingCheck('KAFKA', {
name: configService.getKafkaClientName(),
transport: Transport.KAFKA,
options: {
client: {
brokers: configService.getKafkaHost(),
},
consumer: {
groupId: configService.getKafkaGroupId(),
},
},
});
The check will alway return "timeout of x ms exceeded" whatever how many timeout will be set.
But my service still working well with the same Microservice config.
Minimum reproduction code
this.kafka.pingCheck('KAFKA', {
name: configService.getKafkaClientName(),
transport: Transport.KAFKA,
options: {
client: {
brokers: configService.getKafkaHost(),
},
consumer: {
groupId: configService.getKafkaGroupId(),
},
},
});
Steps to reproduce
No response
Expected behavior
Health check should be return UP status when the kafka is able to connect
Package version
8.0.4
NestJS version
8.2.6
Node.js version
14.17.3
In which operating systems have you tested?
- [X] macOS
- [ ] Windows
- [ ] Linux
Other
No response
Hi if there is anything I can provide for this issue please let me know
Hi @jirawat-tokenx, I advise you to leave a minimal reproduction of a clonable git repository so that the core team can evaluate the problem you have reported.
The check will always return "timeout of
xms exceeded" whatever how many timeout will be set.
This is also the case for me out of the box, however, I can get a successful response by increasing the timeout directly in the options (I can't see this in the options you shared @jirawat-tokenx)
This is what I'm doing to make it work.
const kafkaOptions = {
transport: kafkaConfig.transport,
options: kafkaConfig.options,
timeout: 10000,
};
this.microservice.pingCheck<KafkaOptions>('Kafka', kafkaOptions)
That being said, the response time is unmanageable. A ping check that is taking ~10 seconds to give back a response is absolutely way too much. Is that something that can be looked at @Tony133?
For context, DB and HTTP health indicators that I'm using (from this same package) have a response time of ~1 second.
Hi @fer8a, i think maybe that's something that can be improved, however if you can create a minimal reproduction in a clonable git repository the Nest Core Team can better assess the issue.
Hello @Tony133 sorry for the late reply. It's been some busy days.
Take a look at the following respository
It's a simple example exposing the health indicator at the base route /.
You can use it for the tests.
Hi all, I tried @fer8a minimum reproduction, I did some tests setting the timeout up to 5140 gives the following error (see screenshot):
Log terminal:

while setting the timeout on 5200 and increasing it, the status check works correctly (see screenshot):
Log Terminal:

This is a brief summary of the test I took.
You need to understand if this is a specific problem that affects only the @nestjs/microservices package or if it also affects the @nestjs/terminus package as a whole. Since internally for microservices terminus uses the @nestjs/microservices package.
P.s. For the Kafka service I used a docker container.
Hi everyone. I found that kafka pingCheck works if added producerOnlyMode: true flag. Think, for those who're looking for a solution, it could be helpful.
Example:
this.microservice.pingCheck('kafka', {
transport: Transport.KAFKA,
options: {
...this.configService.get('kafka.options'),
producerOnlyMode: true,
},
}),
@Tony133 If you have time, please check https://github.com/jlahtinen/nestjskafkaterminus
I think this is about as simple as it can get to reproduce described problem. I have not idea how to fix this.
If producerOnlyMode: true is the fix, it should be documented and make clear are the consequences using it (for me all seems to be working with that on localhost, but on k8 all starts to fail again like without producerOnlyMode).
Hi @jlahtinen, I tried your minimal reproduction in a local environment with docker and managed to get it working, but I had to increase the timeout to 9000ms or more, using NestJS v9.x and Terminus v9.x compared to my message I had posted above with NestJS v8.x and Terminus v8.x.
I attach here two screenshots:
I hope I have been of some help
@Tony133 Thank you. Works as you described.
A big problem with this is that the check creates a new connection, causing Kafka to rebalance the consumers, then closes it, which triggers another rebalance.
I think, it's an issue with the approach itself: instead of checking if the existing connection is healthy, the indicator checks if Kafka is connectable, so we don't really know if the microservice itself is still connected.
Yeah, I agree that it's not actually terminus issue. It's natural state of things for Kafka.
If you want to check both producer and consumer connections you have to take into account that it takes some time to join consumer group and make consumer connection stable.
But you can just use producerOnlyMode to significantly speed up things (as suggested above). It should give you a necessary amount of confidence that your microservice is able to reach Kafka cluster.
@Tony133 I'd probably consider introducing default config for Kafka transport consumer config with producerOnlyMode set to true.
Or, if we don't want to mutate health check config from inside of the library we could think of adding warning note to the documentation.
producerOnlyMode will be set per default to true now. We've also added additional e2e tests to ensure this behavior.
If you already have producerOnlyMode set to false it will overwrite the default behavior.
Released with v10.1.0 🎉