botocore fix: tcp_keepalive socket

When using the botocore.config.Config option tcp_keepalive=True, the TCP socket is configured with the keep alive socket option (socket.SO_KEEPALIVE). By default, Linux sets the TCP keepalive time parameter to 7200 seconds, which exceeds the AWS NAT Gateway default timeout of 350 seconds [source].

This limitation leads to an inability to receive a response from a Lambda function under the following conditions:

The Lambda function is invoked in synchronous mode (InvocationType='RequestResponse').
The invocation occurs within VPC where a NAT gateway is required to access the internet from a private subnet.
The execution time of the Lambda function exceeds 350 seconds.

Therefore, by configuring socket.TCP_KEEPIDLE, socket.TCP_KEEPINTVL and socket.TCP_KEEPCNT when tcp_keepalive during the _compute_socket_options function call we can overcome this limitation.

socket.IPPROTO_TCP is used to support cross platform compatibility.

The code submitted automatically calculates these values based on the read timeout. Another option would be to have supplied in the scope/client object.

Fixes issues: https://github.com/boto/boto3/issues/2424, https://github.com/boto/boto3/issues/2510 and https://github.com/boto/botocore/issues/2916.

Fargate recently had a similar solution implemented to support this use case: https://aws.amazon.com/blogs/containers/announcing-additional-linux-controls-for-amazon-ecs-tasks-on-aws-fargate/.

Mar 12 '24 22:03 ShaneNolan

This is also impacting me. Unfortunately we are invoking Lambda from ECS via AWS Batch, which doesn't support adding these new options in the task definition yet.

Jul 15 '24 23:07 adammcdonagh

This issue is same for me. In my case, Lambda connection is read_timeout when EC2 by Codebuild try to connect lambda. It is OK, when Lambda sleep 300sec but read_timeout is occured when Lambda sleep 450 sec. EC2 by Codebuild doesn't join any VPC(Codebuild default)

Oct 03 '24 03:10 smasa1112

experiencing similar issues

File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/botocore/retryhandler.py", line 247, in __call__
    return self._check_caught_exception(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/botocore/retryhandler.py", line 416, in _check_caught_exception
    raise caught_exception
  File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/aiobotocore/endpoint.py", line 181, in _do_get_response
    http_response = await self._send(request)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/aiobotocore/endpoint.py", line 294, in _send
    return await self.http_session.send(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/aiobotocore/httpsession.py", line 261, in send
    raise ReadTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "https://lambda.eu-west-1.amazonaws.com/2015-03-31/functions/redacted/invocations"

Oct 03 '24 13:10 spawn-guy

Hi @nateprewitt / @jonathan343 / @alexgromero / @SamRemis,

I have an Airflow instance running on AWS and I'm using the Airflow LambdaInvokeFunctionOperator to run AWS Lambda functions. When a Lambda function takes 5 minutes or longer to execute, we encounter a ReadTimeoutError. There is an issue in the Airflow repo with more information: https://github.com/apache/airflow/issues/41498.

I’ve tested the changes of this PR, and it is working as expected, handling Lambda functions that take up to 15 minutes to run without issues. Is there anything else needed for the review and merging process? I would appreciate any feedback and updates on its status. Thank you!

Oct 14 '24 15:10 pankajastro

bump. any movement on this PR? my 200s lambda sync invocations are constantly failing with botocore.ReadTimeoutError on Amazon Linux 2023

Nov 05 '24 12:11 spawn-guy

Bump. Any news? I'm facing the same issue with a Lambda function invoked at Airflow through LambdaInvokeFunctionOperator

Jan 29 '25 11:01 rodrigofp-possiblefinance

just tagging active contributor's on the repo to get some attention: @alexgromero , @nateprewitt , @ubaskota

Feb 13 '25 02:02 rawwar

one year anniversary for this pr; I still need to use this work around.

also tagging active contributors: @ubaskota @nateprewitt @arandito @SamRemis

Mar 12 '25 10:03 ShaneNolan

Bumping this issue too, as I'm having the same problems with Lambdas that take more than 350 seconds. I'm forced to use requests with specific configuration to circumvent this problem but there is a working workaround.

Mar 19 '25 12:03 MartinBlanchard3012

We are also facing this issue and it's impacting our production pipelines.

Apr 17 '25 19:04 paperlinguist

While this is definitely worth addressing, these new defaults are significantly different from the old behavior and this would apply to every customer who has opted in to TCP keepalive. Merging this could break existing customer workflows for users who are relying on the current default configurations, and it wouldn't give them any ability to opt out back into the old behavior.

To preserve backwards compatibility, I'd be more in favor of making this an opt in client level configuration - the other solution proposed in the description of this PR. I will bring this up to the botocore team to get some more thoughts and see where others stand.

Apr 22 '25 12:04 SamRemis

Also facing same issue, when waiting reply from lambda, triggered by MWAA based DAG. Applying proposed fix has resolved issue

Jun 22 '25 09:06 xaerto

Hey @SamRemis, I've refactored the code to preserve backwards compatibility and make it configuration based. ☕

Jul 16 '25 18:07 ShaneNolan