kafkajs icon indicating copy to clipboard operation
kafkajs copied to clipboard

Microservice Stopped Consuming Silently

Open Blutude opened this issue 2 years ago • 6 comments

Describe the bug We noticed that one of our microservices had stopped consuming events silently for about a day. This was fixed by manually restarting the service but we need a way to have it get restarted automatically, or at least have the service know that is not consuming events anymore.

I looked at all kafkajs error logs for that entire day, and I did not see anything specific that could indicate that it would stop consuming events. Usually we look for KafkaJsNonRetriableErrors and we automatically force a service restart when we detect those errors, but we did not see any of those in the logs. I did not find any specific log that could indicate we stopped consuming messages.

I exported all kafkajs error logs from that period into a csv file and am attaching it to this issue. Maybe we can find anything that could indicate the service has stopped consuming events?

It could be worthwhile to mention that we are not using kafka, but Azure Event Hub and we are using kafkajs as a connector to Azure Event Hub.

To Reproduce It is the first time we saw this happening, so we do not know how to reproduce it.

Environment:

  • KafkaJS version 1.15.0
  • NodeJS version 12.22.12

Additional context Attaching the csv file of our KafkaJs error logs. Event Client Error Logs.csv

Blutude avatar Aug 02 '22 20:08 Blutude

We faced a similar issue and utilized the consumer heartbeat event to determine if the consumer was still alive. I.e, if it had a heartbeat within the configured sessionTimeout.

mguay22 avatar Aug 02 '22 23:08 mguay22

@mguay22 I am familiar with the heartbeat event and how it's being used to kill a consumer if it does not respond within the sessionTimeout. This is all built in to the package.

We are using the package and we stopped receiving events but the consumer is not leaving / rejoining the consumer group.

Are you saying you added extra custom functionality on top of the heartbeat? Are you creating your own variable in memory, keeping track of the last heartbeat event, and if you don't get any then you shutdown and restart the service?

Blutude avatar Aug 05 '22 16:08 Blutude

Will try out this approach ^ and keep this thread posted.

My strategy is to integrate the microservice's healthcheck to the event client's healthcheck. I will listen to the kafkajs heartbeats and store the lastHeartbeatTimestamp in memory. On each microservice healthcheck, I check that lastHeartbeatTimestamp value and if it is older than 30min or so, then restart the microservice.

Blutude avatar Aug 10 '22 19:08 Blutude

@Blutude That is correct. However, we recently upgraded to v2.1.0 and are no longer seeing this issue. Can you try reproducing on the latest version and confirm?

mguay22 avatar Aug 11 '22 19:08 mguay22

I believe this is linked, and I can no longer reproduce https://github.com/tulios/kafkajs/issues/1163

I can see the same ETIMEDOUT in your logs as well. I think enforceRequestTimeout was recently changed to be enabled by default, which might be the fix here. Or a new runner https://github.com/tulios/kafkajs/pull/650

See https://github.com/tulios/kafkajs/pull/1337

mguay22 avatar Aug 11 '22 19:08 mguay22

I am trying to understand what this change does. I read that the enforceRequestTimeout = true times out the request after 30 seconds (by default). But then why am I getting ETIMEDOUT errors if this setting is currently set to false (I did double check that it is currently not set, so it is taking the current version's default value of false).

Does this change make it so when we get timeouts, then it forces a restart on the consumer? Isn't that what is already happening in the current version? I am seeing in the logs that it it is retrying the request after a timeout, and that it also tries restarting the consumer so I am not clear on what this change does.

Also, are you saying that this change will no longer require me to implement the microservice's heartbeat to look at the lastHeartbeatTimestamp of the consumer?

Blutude avatar Aug 15 '22 20:08 Blutude