AWS Lambda failing w/ RuntimeError: can't start new thread on 1.40.3
How do you use Sentry?
Self-hosted/on-premise
Version
1.40.3
Steps to Reproduce
Recently upgraded from 1.39.2 to 1.40.3.
Our AWS Lambda executors started failing with:
Traceback (most recent call last):
< ... omitted ... >
File "/var/task/sentry_sdk/integrations/threading.py", line 56, in sentry_start
return old_start(self, *a, **kw)
File "/var/lang/lib/python3.9/threading.py", line 899, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
Rolling back to 1.39.2 prevented the RuntimeError: can't start new thread error.
Expected Result
Lambda execution executed as expected.
Actual Result
A run time exception stemming from the sentry_sdk/threading/threading.py module:
File "/var/task/sentry_sdk/integrations/threading.py", line 56, in sentry_start
return old_start(self, *a, **kw)
File "/var/lang/lib/python3.9/threading.py", line 899, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
So as said in the original thread I'm thinking this has to do with the SDK spawning a new thread by default in 1.40+ and possibly hitting some thread limits on AWS. We've already seen something similar in this issue, also AWS related.
If my suspicion is correct then this should make it go away:
sentry_sdk.init(
... # your usual stuff
_experiments={
"enable_metrics": False,
}
)
@thedanfields Could you give this a shot and see if it makes a difference?
Hi @sentrivana
We upgraded to 1.43.0 and are still seeing this issue with the _experiments={"enable_metrics": False} fix.
The issue happens specifically when a lot of lambdas are writing to s3 or reading from it.
We are on 1.43.0.
Hi @sentrivana, can you please also confirm which previous version of the SDK this was working in?
Hey folks, thanks for following up.
@gksb88 Which SDK version did you upgrade from? Are you creating your own threads in your app?
@kerenkhatiwada According to the OP,
Rolling back to 1.39.2 prevented the RuntimeError: can't start new thread error.
In this case, the most likely culprit from the changes between the two versions (1.39.2 and 1.40.3) was turning metrics on by default, and by extension, starting the background metrics flusher thread. In @gksb88's case though this might be a different issue since turning metrics off doesn't make a difference.
@sentrivana this was the beginning of our implementation so we started with 1.40.3 i believe. On threading in our code, we dont kick them off - but like i mentioned it happens to correlate with boto3 s3 calls, so i think they might be kicking off threads under the covers. I haven't had a chance to look at that currently. Replicating it outside of production is tough for us, because it requires a lot of concurrent lambdas running.
A couple of questions:
- Do you suggest we go back to 1.39.2 for now
- Is there a testing method you recommend to catch this earlier?
@gksb88 Since you said you started seeing this after upgrading to 1.43.0, can you confirm that you didn't encounter the issue on 1.40.3? If that's the case, going back to 1.40.3 should be sufficient.
If you have the capacity, bisecting the exact SDK version where this starts happening would help us a lot in localizing the issue. As far as I can tell, the only additional thread we added around that time was the metrics thread, which should not even get spawned with _experiments={"enable_metrics": False}, so I'm at a loss why turning metrics off wouldn't work. It could also be that the issue was always there and was just recently exacerbated by some change in traffic/your setup/etc., which is my current working theory.
Based on what I've read the thread limit per Lambda function is 1024. It'd be very interesting to know what threads are running -- is there any way you can see that?
The SDK itself should normally spawn about ~5 threads max depending on what you have enabled (transport worker, profiler, metrics, backpressure monitor, maybe couple more). One thing that's specific for AWS Lambda is that we might spawn an additional thread here -- I'm wondering if that can get out of hand. Could you try setting the AWS Lambda integration option timeout_warning to False as shown here?
Hi @sentrivana @kerenkhatiwada its not easy to replicate the issue without a lot of lambda invocations running. I see the issue surfaces when at least 4-500 invocations are on. Our application is not kicking off any extra threads, i can confirm that.
We just started our sentry integration last month with 1.39.2. Thats where we saw the issue. Do you think we need to roll further back?
ok dug in a little bit more, the sentry error seems to pop up when s3 activity is happening. Looks like s3 internally uses futures to download things. https://github.com/boto/boto3/blob/9a2673e78018169340db4b85b5ec09906dc380c1/boto3/s3/transfer.py#L383
So with sentry and s3 both kicking off their own threads, at high invocation counts i can see it tripping over lambda thread limits. @sentrivana You said earlier that a new thread is kicked off only in 1.40.0+, but i can confirm that we see this behavior in 1.39.2 as well. Any chance sentry kicks off a thread in 1.39 as well?
@gksb88 The SDK spawns an additional thread in ~1.40, but we were already utilizing threads before that, see https://github.com/getsentry/sentry-python/issues/2741#issuecomment-2058833103:
The SDK itself should normally spawn about ~5 threads max depending on what you have enabled (transport worker, profiler, metrics, backpressure monitor, maybe couple more).
So the one new additional thread shouldn't make that much of a difference. I assumed it might have been the straw that broke the camel's back, but that doesn't seem to be the case here.
Can you please try out the things I mentioned here? There's one thread that's only spawned in an AWS Lambda context that would be especially interesting to turn off and see whether that makes a difference:
One thing that's specific for AWS Lambda is that we might spawn an additional thread here -- I'm wondering if that can get out of hand. Could you try setting the AWS Lambda integration option
timeout_warningtoFalseas shown here?
Hard to say whether going back further will help, my hunch is that it won't. What you can try is start turning off features that use threads (unset profiles_sample_rate if set to disable the profiler, set enable_backpressure_handling=False, set _experiments={"enable_metrics": False}) so that we can figure out if there's a point where you stop encountering this.
Facing the same issue after upgrading our AWS Lambdas to python3.12 We're using sentry-sdk==1.45.0
1721749163172,"ERROR happened during Sentry log msg forming: (<class 'AttributeError'>) 'str' object has no attribute 'copy'
1721749163172,"Traceback (most recent call last):
1721749163172,"File ""/var/lang/lib/python3.12/site-packages/awslambdaric/lambda_runtime_client.py"", line 85, in wait_next_invocation
1721749163173,"future = executor.submit(runtime_client.next)
1721749163173,"^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1721749163173,"File ""/var/lang/lib/python3.12/concurrent/futures/thread.py"", line 179, in submit
1721749163173,"self._adjust_thread_count()
1721749163173,"File ""/var/lang/lib/python3.12/concurrent/futures/thread.py"", line 202, in _adjust_thread_count
1721749163174,"t.start()
1721749163174,"File ""/var/task/sentry_sdk/integrations/threading.py"", line 56, in sentry_start
1721749163174,"return old_start(self, *a, **kw)
1721749163174,"^^^^^^^^^^^^^^^^^^^^^^^^^
1721749163174,"File ""/var/lang/lib/python3.12/threading.py"", line 992, in start
1721749163174,"_start_new_thread(self._bootstrap, ())
1721749163174,"RuntimeError: can't start new thread
1721749163174,"During handling of the above exception, another exception occurred:
1721749163174,"Traceback (most recent call last):
1721749163174,"File ""/var/runtime/bootstrap.py"", line 63, in <module>
1721749163174,"main()
1721749163174,"File ""/var/runtime/bootstrap.py"", line 60, in main
1721749163174,"awslambdaricmain.main([os.environ[""LAMBDA_TASK_ROOT""], os.environ[""_HANDLER""]])
1721749163174,"File ""/var/lang/lib/python3.12/site-packages/awslambdaric/__main__.py"", line 21, in main
1721749163174,"bootstrap.run(app_root, handler, lambda_runtime_api_addr)
1721749163174,"File ""/var/lang/lib/python3.12/site-packages/awslambdaric/bootstrap.py"", line 493, in run
1721749163174,"event_request = lambda_runtime_client.wait_next_invocation()
1721749163175,"^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1721749163175,"File ""/var/lang/lib/python3.12/site-packages/awslambdaric/lambda_runtime_client.py"", line 88, in wait_next_invocation
1721749163175,"raise FaultException(
1721749163175,"awslambdaric.lambda_runtime_exception.FaultException: ('Runtime.LambdaRuntimeClientError', ""LAMBDA_RUNTIME Failed to get next invocation: can't start new thread"", None)
Hey @anton-demydov-zoral, SDK 1.x is not developed anymore outside of security fixes. Can you try with the latest 2.x release to see if this is also an issue in 2.x? If yes, can you try my suggestions from above?
See also our migration guide.