sqs-consumer icon indicating copy to clipboard operation
sqs-consumer copied to clipboard

Healthcheck options?

Open scruffles opened this issue 5 years ago • 5 comments

We have had several cases where our app has stopped listening to the queue. We're not sure what the issue is, but the app itself (an express app running on fargate) has stayed up while the queue fills. We call start at app startup and never call stop. It doesn't happen often, I was hoping to just tie into fargate's healthcheck and let the instance restart if it stops processing the queue. Is there anything available that would let us check that the consumer has a solid connection to sqs other than sending test messages to itself or something? Or if there's something we should check to prevent the problem, that would be great too.

scruffles avatar Aug 14 '20 16:08 scruffles

What you could do is have the consumer send a ping every time it polls the queue. Log that ping somewhere (database or something like that) and have another process check for the ping. If the ping has not been received in X amount of seconds or minutes (or what have you), then send a notification (SMS, email, Slack, etc). This is the only way that I can think of that will still tell you if the consumer is still checking the queue.

You might want to send a ping from message_received and empty at least. You could make the checker smarter as well by using the other event handlers such as timeout_error, processing_error, error, and message_processed by having multiple statuses so that you can keep a better eye on each consumer. We plan to implement something like this in the future to our consumers.

adamleemiller avatar Apr 10 '21 08:04 adamleemiller

We did something similar to that, but it has its downsides. In addition to the complexity of having that extra bit of code, it also doesn't deal well with the fact that we have multiple instances of the service listening to the queue. One or more of them might be down, and we would never know because at least one is processing the ping message.

It would be nice if we could tell from the client that particular instance it isn't healthy and let it re-initalize itself, but I guess without something built into SQS itself to check, theres going to be some extra complexity involved.

scruffles avatar Apr 11 '21 00:04 scruffles

We did something similar to that, but it has its downsides. In addition to the complexity of having that extra bit of code, it also doesn't deal well with the fact that we have multiple instances of the service listening to the queue. One or more of them might be down, and we would never know because at least one is processing the ping message.

It would be nice if we could tell from the client that particular instance it isn't healthy and let it re-initalize itself, but I guess without something built into SQS itself to check, theres going to be some extra complexity involved.

Create a database (MySQL, Mongo, Redis, whatever) and in the ping request, send the instance/server name in the URI or the POST body so that you can keep track of each worker. If you have multiple workers on the same instance/server, then give them a name. This will help you track which one(s) are failing and which ones are still in good health.

adamleemiller avatar Apr 11 '21 00:04 adamleemiller

We have had several cases where our app has stopped listening to the queue. We're not sure what the issue is, but the app itself (an express app running on fargate) has stayed up while the queue fills. We call start at app startup and never call stop. It doesn't happen often, I was hoping to just tie into fargate's healthcheck and let the instance restart if it stops processing the queue. Is there anything available that would let us check that the consumer has a solid connection to sqs other than sending test messages to itself or something? Or if there's something we should check to prevent the problem, that would be great too.

Noticed this one today as well, the server silently stopped processing events and when we did end up checking the queue, it turned out that there were over a 1000 events that had not been processed, restarting the container of course fixed the issue but this might be a serious problem.

judemanutd avatar Apr 24 '21 05:04 judemanutd

Same problem here!

tgiachi avatar Sep 08 '21 08:09 tgiachi

I don't think health checking is a problem for this library to solve.

nicholasgriffintn avatar Dec 09 '22 12:12 nicholasgriffintn