broadway_sqs icon indicating copy to clipboard operation
broadway_sqs copied to clipboard

AWS.SimpleQueueService.BatchEntryIdsNotDistinct

Open bismark opened this issue 4 years ago • 5 comments

We are getting these errors on occasion (once or twice a week):

[error] ** (ExAws.Error) ExAws Request Error! 
 {:error, {:http_error, 400, %{code: "AWS.SimpleQueueService.BatchEntryIdsNotDistinct", detail: "", message: "Id 688a1987-8bce-4093-99ac-8acf3bebd17e repeated.", request_id: "7974caea-316e-53ce-8222-9dc58e5736ba", type: "Sender"}}} 
     (ex_aws 2.1.3) lib/ex_aws.ex:66: ExAws.request!/2 
     (elixir 1.10.2) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2 
     (elixir 1.10.2) lib/enum.ex:783: Enum.each/2 
     (elixir 1.10.2) lib/enum.ex:789: anonymous fn/3 in Enum.each/2 
     (stdlib 3.12.1) maps.erl:232: :maps.fold_1/3 
     (elixir 1.10.2) lib/enum.ex:2127: Enum.each/2 
     (broadway 0.6.0) lib/broadway/consumer.ex:64: Broadway.Consumer.handle_events/3 
     (gen_stage 1.0.0) lib/gen_stage.ex:2395: GenStage.consumer_dispatch/6 

My first hunch was a duplicate message due to at-least-once behavior of standard queues. However, the crash repeats over and over until I go in to the SQS manager and delete the single message...

I'm a bit stumped at the root cause, but would a PR that runs uniq by id on the receipts before sending up the delete batch be welcome?

bismark avatar May 08 '20 21:05 bismark

Hi @bismark! I would prefer not to call uniq on it because it means we would most likely hide the root cause.

Btw, this is the first time we hear a report like this, so I would first start by double-checking your producer (do you have any custom client code?) and your handle_batch to see there are no duped messages.

josevalim avatar May 08 '20 22:05 josevalim

I was finally able to snag a dump of a problematic message batch today:

[
%Broadway.Message{acknowledger: {BroadwaySQS.ExAwsClient, #Reference<0.2449289218.3435134979.72558>, %{receipt: %{id: "22bde59a-a4dd-4e3d-992e-27f8d139ef88", receipt_handle: "AQEBoRN7Fv//r7Ez5RxXr36M99n9hHTz44WsTnosssPc7PNK/j7az9JV80GWZCIqFrm+whczf+RUBCsygPBj8MIjS0endHsPzDq85E2Zrv1fKKsxmEN2t2G/Z8mzuq+grpBSuhLkQHUOwPLeLpSp6QpqEi5WFUCJQ6m9Nla4qL403TxcgmmIrNZSbpWyqrZY0HrDnYNxvHij9+EMedDCvru5RlttISMmbF6gjDKEcJ8Ih6e48JH+8qms2x0GUFmaTTuWOKVgyqA9VQX2V15ON2pYMhG+iZE/zw2X0/QQY1pnG7M6ookkMEzgLButf6nxEJErti8e5NrYscM+qSSWbcvHM7kFV3dv0NKzs3bABAd8bu6IMMm98j3n32MokFZnC4B7HFpTvCsZDvkCXdvzWm2p2XQGXqJPv6IXdIdUUb+CdAs="}}}, batch_key: :default, batch_mode: :bulk, batcher: :default, data: "...", metadata: %{attributes: [], md5_of_body: "1da40121e821d5e42263ceddacad58f7", message_attributes: %{"Title" => %{binary_value: "", data_type: "String", name: "Title", string_value: "zoom", value: "zoom"}}, message_id: "22bde59a-a4dd-4e3d-992e-27f8d139ef88", receipt_handle: "AQEBoRN7Fv//r7Ez5RxXr36M99n9hHTz44WsTnosssPc7PNK/j7az9JV80GWZCIqFrm+whczf+RUBCsygPBj8MIjS0endHsPzDq85E2Zrv1fKKsxmEN2t2G/Z8mzuq+grpBSuhLkQHUOwPLeLpSp6QpqEi5WFUCJQ6m9Nla4qL403TxcgmmIrNZSbpWyqrZY0HrDnYNxvHij9+EMedDCvru5RlttISMmbF6gjDKEcJ8Ih6e48JH+8qms2x0GUFmaTTuWOKVgyqA9VQX2V15ON2pYMhG+iZE/zw2X0/QQY1pnG7M6ookkMEzgLButf6nxEJErti8e5NrYscM+qSSWbcvHM7kFV3dv0NKzs3bABAd8bu6IMMm98j3n32MokFZnC4B7HFpTvCsZDvkCXdvzWm2p2XQGXqJPv6IXdIdUUb+CdAs="}, status: :ok},
%Broadway.Message{acknowledger: {BroadwaySQS.ExAwsClient, #Reference<0.2449289218.3435134979.72558>, %{receipt: %{id: "22bde59a-a4dd-4e3d-992e-27f8d139ef88", receipt_handle: "AQEBnYymgMijksDbjAP/lo6yoyz5o1/A2mkCw0IYbCfqnND4K8/TEX1QoIGrelpeVors50/t2tM/Vntu1Bap9qTC+A4B4xn1A5MCFB/HSy+QnFtCKDjrvY9fpYApIU3R+voFARBijBSGBs3h8+IC7KNMy66MtoZspZEPQOLQTs3PqLKDizi/f6uRxzGdEY+/bpHG5saWpS6iGNKM961JF+oSlSJOeQn/YFJP4MuF0LOmUn5i4/RoF4CsZY6Igvbczk52b8oVhJvDJGpeSH4taLPS6NOB5Uaq81YhLQ1r275b0XhzLg5WWZurooY6kpS6vP7EGkIrZpFmjvVoX4aiLBcmNBbDwWSbl6ZGGe6/TCFG2vjWa/CmLy1k3IYqvw3tq0w9Tr3grg9FnMlyS5N710+W7rdN7xddIzfN352c3xovwxE="}}}, batch_key: :default, batch_mode: :bulk, batcher: :default, data: "...", metadata: %{attributes: [], md5_of_body: "1da40121e821d5e42263ceddacad58f7", message_attributes: %{"Title" => %{binary_value: "", data_type: "String", name: "Title", string_value: "zoom", value: "zoom"}}, message_id: "22bde59a-a4dd-4e3d-992e-27f8d139ef88", receipt_handle: "AQEBnYymgMijksDbjAP/lo6yoyz5o1/A2mkCw0IYbCfqnND4K8/TEX1QoIGrelpeVors50/t2tM/Vntu1Bap9qTC+A4B4xn1A5MCFB/HSy+QnFtCKDjrvY9fpYApIU3R+voFARBijBSGBs3h8+IC7KNMy66MtoZspZEPQOLQTs3PqLKDizi/f6uRxzGdEY+/bpHG5saWpS6iGNKM961JF+oSlSJOeQn/YFJP4MuF0LOmUn5i4/RoF4CsZY6Igvbczk52b8oVhJvDJGpeSH4taLPS6NOB5Uaq81YhLQ1r275b0XhzLg5WWZurooY6kpS6vP7EGkIrZpFmjvVoX4aiLBcmNBbDwWSbl6ZGGe6/TCFG2vjWa/CmLy1k3IYqvw3tq0w9Tr3grg9FnMlyS5N710+W7rdN7xddIzfN352c3xovwxE="}, status: :ok}
]

So same message, different receipt handles...

Producer is {BroadwaySQS.Producer, queue_url: "<url>", message_attribute_names: :all}.

handle_batch:

  def handle_batch(_, messages, _, _), do: messages

bismark avatar May 14 '20 22:05 bismark

I see. This SO answer provides some insights on your options: https://stackoverflow.com/questions/23815080/how-to-handle-sqs-re-sending-the-same-message-to-the-requestor

I believe it would still be important to understand why this is happening. Is it a misconfiguration on AWS side? Low timeouts, etc?

If you believe this will be common, then you will need to store in memory or on disk or on Redis or on the DB the message IDs and skip the message if you have already seen it. You can use Broadway.Message.ack_immediately to ack it immediately outside of a batcher flow. More information here: https://github.com/dashbitco/broadway_sqs/blob/master/lib/broadway_sqs/producer.ex#L75

josevalim avatar May 15 '20 07:05 josevalim

We also have run into this when processing messages from SQS in batches. The same message may be redelivered if it was never deleted within the processing window in a previous request.

If you receive a message more than once, each time you receive it, you get a different receipt handle. You must provide the most recently received receipt handle when you request to delete the message (otherwise, the message might not be deleted). https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-queue-message-identifiers.html

We could add filtering for this in our handle_batch/4 implementations but this is something that would be nice to have an out of the box solution for everyone. Since there doesn't appear to be an option in the batch message to receive only unique message identifiers, I think handling this case directly in this library makes the most sense.

Also some factors to consider are the WaitTimeSeconds in the call to receive messages count towards the message visibility timeout. Adjusting these should also help prevent this scenario from occurring.

To avoid HTTP errors, ensure that the HTTP response timeout for ReceiveMessage requests is longer than the WaitTimeSeconds parameter. https://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ReceiveMessage.html

rraub avatar Jul 31 '20 19:07 rraub

Do you have any thoughts on #63?

benkeil avatar Sep 29 '21 04:09 benkeil

Closing in favor of #63 to not have dup discussion.

josevalim avatar Feb 10 '23 18:02 josevalim