aws-sdk-js-v3
aws-sdk-js-v3 copied to clipboard
TCP Keepalive probes are not sent when using AWS JS SDK
Describe the bug
AWS NAT Gateway/VPCE have 350 seconds timeout for idle connections. After 350 seconds the connection will be ungracefully closed by sending a RST packet and become unusable. This is not unique to NAT GW/VPCE, this technique is common in other network appliances. TCP supports Keepalive specification in order to prevent idle connections from being closed. Some AWS SDKs/CLI do support enabling TCP Keepalive e.g. see
- https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/http/apache/ApacheHttpClient.Builder.html (look for tcpKeepAlive)
- https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html (look for tcp_keepalive)
- https://docs.aws.amazon.com/cli/latest/topic/config-vars.html (look for tcp_keepalive)
AWS JS SDK doesn't seem to support this capability. The only mention of TCP Keepalive I found was at the very bottom of this doc article, however when I tested it - it didn't work, the request made through NAT GW that lasted for more than 350 seconds never completes, and remains in hanging state. As a result it is impossible to make a request to a resource that returns response after >350seconds, e.g. it is impossible to synchonously invoke Lambda that takes more than 350 seconds to execute.
(I'm using Lambda as an example, but this would apply to any other connection that has a potential to remain idle for >350seconds).
Your environment
EC2 instance with AL2
SDK version number
@aws-sdk/[email protected]
Is the issue in the browser/Node.js/ReactNative?
Nodejs
Details of the browser/Node.js/ReactNative version
$ node -v v14.19.3 $ npm -v 6.14.17
Steps to reproduce
- Create a Lambda function1 that sleeps for 340 seconds and then returns 'OK'
- Create a Lambda function2 that sleeps for 360 seconds and then returns 'OK'
- Create an EC2 in private subnet, expose it to internet via NAT GW in a public subnet
- Update OS level TCP Keepalive settings in /etc/sysctl.conf of the EC2 instance to values <2 minutes, e.g.

- Reboot EC2 instance.
- [Test1] Create a simple NodeJS app that uses AWS SDK to invoke function1, run it
- [Test2] Create a simple NodeJS app that uses AWS SDK to invoke function2, run it
Lambda function code (server side)
// const waitForSeconds = 1;
// const waitForSeconds = 10;
// const waitForSeconds = 30;
// const waitForSeconds = 70; // To test auto-retry after 60 seconds
// const waitForSeconds = 2*60;
// const waitForSeconds = 5*60; // 300
const waitForSeconds = 6*60; // 660
// const waitForSeconds = 7*60;
// const waitForSeconds = 10*60;
// const waitForSeconds = 12*60;
// const waitForSeconds = (15*60)-10;
exports.handler = async (event, context) => {
await new Promise(async (resolve, reject)=> {
for (let i=0; i<waitForSeconds; i++){
console.log({i});
await new Promise((resolve, reject)=>setTimeout(resolve, 1000));
}
resolve();
})
return 'Hello from F1';
}
Application code (client side)
const { LambdaClient, InvokeCommand } = require('@aws-sdk/client-lambda');
const { NodeHttpHandler} = require('@aws-sdk/node-http-handler');
const { Agent } = require('http');
const agent = new Agent({
keepAlive: true
});
const lambdaClient = new LambdaClient({
requestHandler: new NodeHttpHandler({
httpAgent: agent
})
});
const invokeParams = {
FunctionName: 'f1',
Payload: ''
};
const command = new InvokeCommand(invokeParams);
lambdaClient.send(command).then((data)=>{
console.log(data);
}, (error)=>{
console.error(error);
});
Observed behavior
- Test1 - after 340 seconds: Lambda function1 execution completes successfully; AWS SDK request completes successfully.
- Test2 - after 360 seconds: Lambda function 2 exection completes successfully; AWS SDK request remains in a waiting/hanging state and never completes with either success/failure (I waited for several minutes, maybe there's a timeout after longer period of time).
Expected behavior
- Test2 - after 360 seconds invocation completes successfully.
Screenshots
Additional context
I've created internal write up regarding addressing this problem with other clients - AWS Java SDK, Python SDK, CLI. Please Slack me internally for the link (@antonaws).
does this issue persist when you initialize the Agent and NodeHttpHandler with higher timeouts?
const agent = new Agent({
keepAlive: true,
// keepAliveMsecs: 999,
// maxSockets: 100,
// maxTotalSockets: 100,
// maxFreeSockets: 256,
timeout: 900 * 1000,
// scheduling: "fifo",
});
const lambdaClient = new LambdaClient({
region: 'us-east-1',
requestHandler: new NodeHttpHandler({
connectionTimeout: 8000,
socketTimeout: 900 * 1000,
httpAgent: agent,
}),
});
Is there any update on this? Thanks.
We ran into this as well as part of upgrading from v2 to v3. v2 code which performed some long (10min) processing tasks in between requests to SQS did not work after upgrade because of this behavior. Finding all of the places this may happen in a sizable codebase seems difficult, so this seems like a barrier to upgrade and something which the client should handle?
Hi @aal80, @robfig, @macsir, sorry to hear you are having some issues. After working on an internal report about this issue, I was able to find that the issue is due to a bug with https lib, from nodejs, that is not setting the keep alive settings to the socket. As of right now I have created a bug report with nodejs, which is this one, and I have also posted a PR that will workaround this behavior in the SDK, by adding a listener to the request for when a new socket is created, and there we set the keep-alive options to the socket.
As soon I as I get more updates about this I will update this thread.
Thanks!
@aal80 @robfig @macsir and all others impacted by this bug -
there's a simple hack you can use to avoid this issue: complete a guaranteed fast request with the client, before attempting the >5 minute request.
e.g. if there's a lambda invoke that may take >5 minutes, first use the client to make an invoke request with the DryRun parameter set to true, then attempt the >5 minute invoke.
DryRun – Validate parameter values and verify that the user or role has permission to invoke the function. https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Lambda.html
this works because the keepalive parameters are not applied until after the first request is completed. you can see it in this code -
state.keepAliveTimeoutSet = true;
I highly recommend adding this workaround to your code - bnoordhuis mentions that this is intended behavior in node, so there's no guarantee it will ever be fixed:
The keepAliveMsecs option isn't applied until the socket is added to the connection pool, i.e., after the first request. That's by design, as far as I'm aware. (2) is a policy change. Best way forward is to open a pull request and see how it's received. https://github.com/nodejs/node/issues/47137#issuecomment-1477574618
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.