kombu icon indicating copy to clipboard operation
kombu copied to clipboard

Celery crashes in cases where it tries to call SQS ChangeMessageVisibility after expired ReceiptHandle

Open ajakubo1 opened this issue 4 years ago • 4 comments

We have encountered a problem in our system which is probably closely related to #1198 issue.

The message that caused celery crash from the logs:

ClientError('An error occurred (InvalidParameterValue) when calling the ChangeMessageVisibility operation: Value <message value> for parameter ReceiptHandle is invalid. Reason: The receipt handle has expired.')

So the error is very similar, it just occurs while running ChangeMessageVisibility, not DeleteMessage as in the original issue.

From what I noticed, the scenario to replicate the problem is exactly the same as in the previous issue, just instead of a regular queue, one must use a fifo queue (I tried replicating an issue on a non-fifo queue and it doesn't happen).

I believe that this error occurs because SQS.QoS.reject has an if statement for:

if routing_key and message and backoff_tasks and backoff_policy:

which is calling apply_backoff_policy which is then calling change_message_visibility.

ajakubo1 avatar Sep 20 '21 11:09 ajakubo1

Hey @ajakubo1 :wave:, Thank you for opening an issue. We will get back to you as soon as we can. Also, check out our Open Collective and consider backing us - every little helps!

We also offer priority support for our sponsors. If you require immediate assistance please consider sponsoring us.

what about https://github.com/celery/kombu/pull/1199

auvipy avatar Sep 20 '21 14:09 auvipy

I'm using:

celery==5.1.1
kombu==5.1.0

The changes from that PR are in the code I'm using. And I cannot replicate the originally reported issue in a non-fifo queue.

For fifo queue on the other hand - the behavior is similar to that previous issue - error during ChangeMessageVisibility is raised and celery process crashes.

I'm not 100% sure - but I'm assuming that this is due to the fix implemented in that PR - super(Channel, self).basic_reject(delivery_tag) is probably forcing SQS.QoS.reject to be called. That fix basically fixed the issue for a non-fifo queue and made a fifo queue error out with ChangeMessageVisibility (instead of DeleteMessage) due to backoff_policy being present?

I'm not sure though if my assumptions are correct, as I didn't analyze all of the ways that ChangeMessageVisibility might be called.

ajakubo1 avatar Sep 20 '21 14:09 ajakubo1

The same happens when celery receives 500 error from AWS:

UNABLE TO RESTORE 1 MESSAGES: (ClientError('An error occurred (500) when calling the ChangeMessageVisibility operation (reached max retries: 4): Internal Server Error'),)
EMERGENCY DUMP STATE TO FILE -> /tmp/tmp29uvcr72 <-
Cannot pickle state: TypeError("cannot pickle 'Message' object: a class that defines __slots__ without defining __getstate__ cannot be pickled with protocol 0"). Fallback to pformat.

K0Te avatar Jan 10 '22 16:01 K0Te