kombu
kombu copied to clipboard
Celery crashes in cases where it tries to call SQS ChangeMessageVisibility after expired ReceiptHandle
We have encountered a problem in our system which is probably closely related to #1198 issue.
The message that caused celery crash from the logs:
ClientError('An error occurred (InvalidParameterValue) when calling the ChangeMessageVisibility operation: Value <message value> for parameter ReceiptHandle is invalid. Reason: The receipt handle has expired.')
So the error is very similar, it just occurs while running ChangeMessageVisibility, not DeleteMessage as in the original issue.
From what I noticed, the scenario to replicate the problem is exactly the same as in the previous issue, just instead of a regular queue, one must use a fifo queue (I tried replicating an issue on a non-fifo queue and it doesn't happen).
I believe that this error occurs because SQS.QoS.reject has an if statement for:
if routing_key and message and backoff_tasks and backoff_policy:
which is calling apply_backoff_policy which is then calling change_message_visibility.
Hey @ajakubo1 :wave:, Thank you for opening an issue. We will get back to you as soon as we can. Also, check out our Open Collective and consider backing us - every little helps!
We also offer priority support for our sponsors. If you require immediate assistance please consider sponsoring us.
what about https://github.com/celery/kombu/pull/1199
I'm using:
celery==5.1.1
kombu==5.1.0
The changes from that PR are in the code I'm using. And I cannot replicate the originally reported issue in a non-fifo queue.
For fifo queue on the other hand - the behavior is similar to that previous issue - error during ChangeMessageVisibility is raised and celery process crashes.
I'm not 100% sure - but I'm assuming that this is due to the fix implemented in that PR - super(Channel, self).basic_reject(delivery_tag) is probably forcing SQS.QoS.reject to be called. That fix basically fixed the issue for a non-fifo queue and made a fifo queue error out with ChangeMessageVisibility (instead of DeleteMessage) due to backoff_policy being present?
I'm not sure though if my assumptions are correct, as I didn't analyze all of the ways that ChangeMessageVisibility might be called.
The same happens when celery receives 500 error from AWS:
UNABLE TO RESTORE 1 MESSAGES: (ClientError('An error occurred (500) when calling the ChangeMessageVisibility operation (reached max retries: 4): Internal Server Error'),)
EMERGENCY DUMP STATE TO FILE -> /tmp/tmp29uvcr72 <-
Cannot pickle state: TypeError("cannot pickle 'Message' object: a class that defines __slots__ without defining __getstate__ cannot be pickled with protocol 0"). Fallback to pformat.