fix: tcp_keepalive socket
When using the botocore.config.Config option tcp_keepalive=True, the TCP socket is configured with the keep alive socket option (socket.SO_KEEPALIVE). By default, Linux sets the TCP keepalive time parameter to 7200 seconds, which exceeds the AWS NAT Gateway default timeout of 350 seconds [source].
This limitation leads to an inability to receive a response from a Lambda function under the following conditions:
- The Lambda function is invoked in synchronous mode (InvocationType='RequestResponse').
- The invocation occurs within VPC where a NAT gateway is required to access the internet from a private subnet.
- The execution time of the Lambda function exceeds 350 seconds.
Therefore, by configuring socket.TCP_KEEPIDLE, socket.TCP_KEEPINTVL and socket.TCP_KEEPCNT when tcp_keepalive during the _compute_socket_options function call we can overcome this limitation.
socket.IPPROTO_TCP is used to support cross platform compatibility.
The code submitted automatically calculates these values based on the read timeout. Another option would be to have supplied in the scope/client object.
Fixes issues: https://github.com/boto/boto3/issues/2424, https://github.com/boto/boto3/issues/2510 and https://github.com/boto/botocore/issues/2916.
Fargate recently had a similar solution implemented to support this use case: https://aws.amazon.com/blogs/containers/announcing-additional-linux-controls-for-amazon-ecs-tasks-on-aws-fargate/.
This is also impacting me. Unfortunately we are invoking Lambda from ECS via AWS Batch, which doesn't support adding these new options in the task definition yet.
This issue is same for me. In my case, Lambda connection is read_timeout when EC2 by Codebuild try to connect lambda. It is OK, when Lambda sleep 300sec but read_timeout is occured when Lambda sleep 450 sec. EC2 by Codebuild doesn't join any VPC(Codebuild default)
experiencing similar issues
File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/botocore/retryhandler.py", line 247, in __call__
return self._check_caught_exception(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/botocore/retryhandler.py", line 416, in _check_caught_exception
raise caught_exception
File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/aiobotocore/endpoint.py", line 181, in _do_get_response
http_response = await self._send(request)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/aiobotocore/endpoint.py", line 294, in _send
return await self.http_session.send(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/aiobotocore/httpsession.py", line 261, in send
raise ReadTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "https://lambda.eu-west-1.amazonaws.com/2015-03-31/functions/redacted/invocations"
Hi @nateprewitt / @jonathan343 / @alexgromero / @SamRemis,
I have an Airflow instance running on AWS and I'm using the Airflow LambdaInvokeFunctionOperator to run AWS Lambda functions. When a Lambda function takes 5 minutes or longer to execute, we encounter a ReadTimeoutError. There is an issue in the Airflow repo with more information: https://github.com/apache/airflow/issues/41498.
I’ve tested the changes of this PR, and it is working as expected, handling Lambda functions that take up to 15 minutes to run without issues. Is there anything else needed for the review and merging process? I would appreciate any feedback and updates on its status. Thank you!
bump. any movement on this PR? my 200s lambda sync invocations are constantly failing with botocore.ReadTimeoutError on Amazon Linux 2023
Bump. Any news? I'm facing the same issue with a Lambda function invoked at Airflow through LambdaInvokeFunctionOperator
just tagging active contributor's on the repo to get some attention: @alexgromero , @nateprewitt , @ubaskota
one year anniversary for this pr; I still need to use this work around.
also tagging active contributors: @ubaskota @nateprewitt @arandito @SamRemis
Bumping this issue too, as I'm having the same problems with Lambdas that take more than 350 seconds. I'm forced to use requests with specific configuration to circumvent this problem but there is a working workaround.
We are also facing this issue and it's impacting our production pipelines.
While this is definitely worth addressing, these new defaults are significantly different from the old behavior and this would apply to every customer who has opted in to TCP keepalive. Merging this could break existing customer workflows for users who are relying on the current default configurations, and it wouldn't give them any ability to opt out back into the old behavior.
To preserve backwards compatibility, I'd be more in favor of making this an opt in client level configuration - the other solution proposed in the description of this PR. I will bring this up to the botocore team to get some more thoughts and see where others stand.
Also facing same issue, when waiting reply from lambda, triggered by MWAA based DAG. Applying proposed fix has resolved issue
Hey @SamRemis, I've refactored the code to preserve backwards compatibility and make it configuration based. ☕