aws-sdk-js-v3 icon indicating copy to clipboard operation
aws-sdk-js-v3 copied to clipboard

lib-dynamodb did not throw ServiceExceptions during Oct 20 DNS outage

Open justin-masse opened this issue 2 months ago • 4 comments

Checkboxes for prior research

Describe the bug

Hello,

When trying to diagnose why some of our custom health checks did not failover our region immediately when DDB outage began we recognized it was because we were expecting ServiceException to be thrown by the SDK client with a $fault value on the exception.

We are checking the $fault to ensure this was a server fault before we determine that it is likely DDB is in a failure state... however it looks like the exceptions coming back when the DNS was down were likely just pushed back to our client as some sort of generic HTTP exception?

What I see in our logs is this:

{"errorType":"Runtime.UnhandledPromiseRejection","errorMessage":"Error: getaddrinfo ENOTFOUND dynamodb.us-east-1.amazonaws.com","reason":{"errorType":"Error","errorMessage":"getaddrinfo ENOTFOUND dynamodb.us-east-1.amazonaws.com","code":"ENOTFOUND","errno":-3007,"syscall":"getaddrinfo","hostname":"dynamodb.us-east-1.amazonaws.com","$metadata":{"attempts":3,"totalRetryDelay":259},"stack":["Error: getaddrinfo ENOTFOUND dynamodb.us-east-1.amazonaws.com"," at GetAddrInfoReqWrap.onlookupall [as oncomplete] (node:dns:122:26)"," at GetAddrInfoReqWrap.callbackTrampoline (node:internal/async_hooks:130:17)"]},"promise":{},"stack":["Runtime.UnhandledPromiseRejection: Error: getaddrinfo ENOTFOUND dynamodb.us-east-1.amazonaws.com"," at process.<anonymous> (file:///var/runtime/index.mjs:1448:17)"," at process.emit (node:events:518:28)"," at emitUnhandledRejection (node:internal/process/promises:252:13)"," at throwUnhandledRejectionsMode (node:internal/process/promises:388:19)"," at processPromiseRejections (node:internal/process/promises:475:17)"," at process.processTicksAndRejections (node:internal/process/task_queues:106:32)"]}

I'm having difficulty determining whether this is the direct SDK output from one of our services or not... but I cannot local a single log that says $fault: 'server' as I would have expected during a DNS outage such as this?

Regression Issue

  • [ ] Select this option if this issue appears to be a regression.

SDK version number

@aws-sdk/lib-dynamodb @ latest

Which JavaScript Runtime is this issue in?

Node.js

Details of the browser/Node.js/ReactNative version

18, 20, 22 all the same

Reproduction Steps

Not sure how to reproduce a dynamodb regional DNS failure

Observed Behavior

Did not see any ServiceException errors

Expected Behavior

Ideally would like these sort of things to be formatted as a ServiceException in some way so they can be caught and acted on.

Possible Solution

No response

Additional Information/Context

No response

justin-masse avatar Nov 06 '25 20:11 justin-masse

Error: getaddrinfo ENOTFOUND indicates a connection issue and the cause cannot be differentiated from the client perspective as to whether there is a DNS outage caused by the service or because of network configuration localized to the account or client machine.

This type of outage is not really detectable by the SDK $fault indicator, since contents of this field is determined by the server response. If there's no response at all, there's also no value given.

kuhe avatar Nov 06 '25 20:11 kuhe

Error: getaddrinfo ENOTFOUND indicates a connection issue and the cause cannot be differentiated from the client perspective as to whether there is a DNS outage caused by the service or because of network configuration localized to the account or client machine.

This type of outage is not really detectable by the SDK $fault indicator, since contents of this field is determined by the server response. If there's no response at all, there's also no value given.

Is there a way for the client to override the DNS entry point for dynamodb in a given region? I guess I was under the impression that if this is something the client cannot actually control/change that it should be labeled as a service exception? If aws-sdk cannot even connect to DDB to determine if a command can run it feels like that is a "server" error?

justin-masse avatar Nov 06 '25 20:11 justin-masse

Custom endpoints can be configured in the SDK, but not DNS.

Not everything outside the view of the client is a service exception. The inability to connect, although in this case caused by the service, cannot be assumed in general to be a service exception.

But, you can assume this in your application logic by comparing

const isExternalFault = error.$fault !== "client";

// instead of
error.$fault === "server";

kuhe avatar Nov 07 '25 16:11 kuhe

Custom endpoints can be configured in the SDK, but not DNS.

Not everything outside the view of the client is a service exception. The inability to connect, although in this case caused by the service, cannot be assumed in general to be a service exception.

But, you can assume this in your application logic by comparing

const isExternalFault = error.$fault !== "client";

// instead of error.$fault === "server";

I don't think that works, because the error was not coming back with a $fault either way from what it appears? It seems to be a generic exception being thrown versus a formatted ServiceException. It appears also the $fault parameter only has 2 options either client | server based on the type in the docs.

I know we can't catch 100% of things by using $fault = "server" which is fine, but this was something I guess I was expecting to be returned as a ServiceException and labeled as a server error since no config should ever be provided which results in a 500 or ENOTFOUND via the SDK?

The way I'm changing our code to work is like:

catch (error) {
    // If the error is a client error, then the region is LIKELY online but there's a misconfiguration.
    return error instanceof DynamoDBServiceException && error.$fault === 'client'
 }

So yeah ANY error that comes back that is formatted as a service exception and it's a client fault error we'll just continue to mark that the region is ACTIVE.. but guess our previous iteration always expected this to come back as a service exception

justin-masse avatar Nov 07 '25 17:11 justin-masse