aws-node-termination-handler icon indicating copy to clipboard operation
aws-node-termination-handler copied to clipboard

No SQS retry on "read: connection reset by peer"

Open eugenea opened this issue 3 years ago • 4 comments

Describe the bug NTH does not retry request over AWS SDK API to retrieve SQS queue message.

Steps to reproduce Close firewall to SQS AWS endpoint and try to monitor for SQS events.

Expected outcome The network layer cannot be guaranteed to be reliable so need to implement retry logic here.

Application Logs

WRN There was a problem monitoring for events error="RequestError: send request failed\ncaused by: Post \"https://sqs.us-west-2.amazonaws.com/\": read tcp 100.100.xx.xx:xxxx->10.xx.xx.xx:443: read: connection reset by peer" event_type=SQS_TERMINATE

Environment

  • NTH App Version: v1.16.3
  • NTH Mode (IMDS/Queue processor): Queue processor
  • OS/Arch: Linux
  • Kubernetes version: v1.21.14-eks
  • Installation method: deployment

The check that denies retry is here For V1 of AWS SDK the fix should be custom retryer which re-implements should retry, and custom retryer should be injected here, however upgrade to V2 of AWS SKD should fix this issue automatically, because it does not make distinction between different kinds of connection reset and retries them all which is desired behavior here.

eugenea avatar Nov 02 '22 23:11 eugenea

Thank you for the suggestion! Upon first reading, we would favor doing the upgrade to v2 of the AWS SDK if it can handle this logic automatically.

snay2 avatar Nov 16 '22 19:11 snay2

Do you have any timeline/plan for v2 upgrade?

eugenea avatar Nov 30 '22 21:11 eugenea

No firm timeline yet, but it's one of our ongoing projects at the moment.

snay2 avatar Dec 02 '22 17:12 snay2

@eugenea, the beta version of the NTH v2 upgrade has recently been released. Have you had a chance to investigate whether then new SDK can handle this use case?

jillmon avatar Jan 19 '23 18:01 jillmon