sentry-python AWS Lambda failing w/ RuntimeError: can't start new thread on 1.40.3

How do you use Sentry?

Self-hosted/on-premise

Version

1.40.3

Steps to Reproduce

Recently upgraded from 1.39.2 to 1.40.3.

Our AWS Lambda executors started failing with:

Traceback (most recent call last):
  < ... omitted ... >
  File "/var/task/sentry_sdk/integrations/threading.py", line 56, in sentry_start
    return old_start(self, *a, **kw)
  File "/var/lang/lib/python3.9/threading.py", line 899, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

Rolling back to 1.39.2 prevented the RuntimeError: can't start new thread error.

Expected Result

Lambda execution executed as expected.

Actual Result

A run time exception stemming from the sentry_sdk/threading/threading.py module:

File "/var/task/sentry_sdk/integrations/threading.py", line 56, in sentry_start
    return old_start(self, *a, **kw)
File "/var/lang/lib/python3.9/threading.py", line 899, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

Feb 15 '24 16:02 thedanfields

So as said in the original thread I'm thinking this has to do with the SDK spawning a new thread by default in 1.40+ and possibly hitting some thread limits on AWS. We've already seen something similar in this issue, also AWS related.

If my suspicion is correct then this should make it go away:

sentry_sdk.init(
    ... # your usual stuff
    _experiments={
        "enable_metrics": False,
    }
)

@thedanfields Could you give this a shot and see if it makes a difference?

Feb 15 '24 16:02 sentrivana

Hi @sentrivana

We upgraded to 1.43.0 and are still seeing this issue with the _experiments={"enable_metrics": False} fix.

The issue happens specifically when a lot of lambdas are writing to s3 or reading from it.

We are on 1.43.0.

Apr 12 '24 17:04 gksb88

Hi @sentrivana, can you please also confirm which previous version of the SDK this was working in?

Apr 12 '24 19:04 kerenkhatiwada

Hey folks, thanks for following up.

@gksb88 Which SDK version did you upgrade from? Are you creating your own threads in your app?

@kerenkhatiwada According to the OP,

Rolling back to 1.39.2 prevented the RuntimeError: can't start new thread error.

In this case, the most likely culprit from the changes between the two versions (1.39.2 and 1.40.3) was turning metrics on by default, and by extension, starting the background metrics flusher thread. In @gksb88's case though this might be a different issue since turning metrics off doesn't make a difference.

Apr 15 '24 08:04 sentrivana

@sentrivana this was the beginning of our implementation so we started with 1.40.3 i believe. On threading in our code, we dont kick them off - but like i mentioned it happens to correlate with boto3 s3 calls, so i think they might be kicking off threads under the covers. I haven't had a chance to look at that currently. Replicating it outside of production is tough for us, because it requires a lot of concurrent lambdas running.

A couple of questions:

Do you suggest we go back to 1.39.2 for now
Is there a testing method you recommend to catch this earlier?

Apr 15 '24 18:04 gksb88

@gksb88 Since you said you started seeing this after upgrading to 1.43.0, can you confirm that you didn't encounter the issue on 1.40.3? If that's the case, going back to 1.40.3 should be sufficient.

If you have the capacity, bisecting the exact SDK version where this starts happening would help us a lot in localizing the issue. As far as I can tell, the only additional thread we added around that time was the metrics thread, which should not even get spawned with _experiments={"enable_metrics": False}, so I'm at a loss why turning metrics off wouldn't work. It could also be that the issue was always there and was just recently exacerbated by some change in traffic/your setup/etc., which is my current working theory.

Based on what I've read the thread limit per Lambda function is 1024. It'd be very interesting to know what threads are running -- is there any way you can see that?

The SDK itself should normally spawn about ~5 threads max depending on what you have enabled (transport worker, profiler, metrics, backpressure monitor, maybe couple more). One thing that's specific for AWS Lambda is that we might spawn an additional thread here -- I'm wondering if that can get out of hand. Could you try setting the AWS Lambda integration option timeout_warning to False as shown here?

Apr 16 '24 11:04 sentrivana

Hi @sentrivana @kerenkhatiwada its not easy to replicate the issue without a lot of lambda invocations running. I see the issue surfaces when at least 4-500 invocations are on. Our application is not kicking off any extra threads, i can confirm that.

We just started our sentry integration last month with 1.39.2. Thats where we saw the issue. Do you think we need to roll further back?

Apr 24 '24 16:04 gksb88

ok dug in a little bit more, the sentry error seems to pop up when s3 activity is happening. Looks like s3 internally uses futures to download things. https://github.com/boto/boto3/blob/9a2673e78018169340db4b85b5ec09906dc380c1/boto3/s3/transfer.py#L383

So with sentry and s3 both kicking off their own threads, at high invocation counts i can see it tripping over lambda thread limits. @sentrivana You said earlier that a new thread is kicked off only in 1.40.0+, but i can confirm that we see this behavior in 1.39.2 as well. Any chance sentry kicks off a thread in 1.39 as well?

Apr 24 '24 22:04 gksb88

@gksb88 The SDK spawns an additional thread in ~1.40, but we were already utilizing threads before that, see https://github.com/getsentry/sentry-python/issues/2741#issuecomment-2058833103:

The SDK itself should normally spawn about ~5 threads max depending on what you have enabled (transport worker, profiler, metrics, backpressure monitor, maybe couple more).

So the one new additional thread shouldn't make that much of a difference. I assumed it might have been the straw that broke the camel's back, but that doesn't seem to be the case here.

Can you please try out the things I mentioned here? There's one thread that's only spawned in an AWS Lambda context that would be especially interesting to turn off and see whether that makes a difference:

One thing that's specific for AWS Lambda is that we might spawn an additional thread here -- I'm wondering if that can get out of hand. Could you try setting the AWS Lambda integration option timeout_warning to False as shown here?

Hard to say whether going back further will help, my hunch is that it won't. What you can try is start turning off features that use threads (unset profiles_sample_rate if set to disable the profiler, set enable_backpressure_handling=False, set _experiments={"enable_metrics": False}) so that we can figure out if there's a point where you stop encountering this.

Apr 29 '24 10:04 sentrivana

Facing the same issue after upgrading our AWS Lambdas to python3.12 We're using sentry-sdk==1.45.0

1721749163172,"ERROR happened during Sentry log msg forming: (<class 'AttributeError'>) 'str' object has no attribute 'copy'
1721749163172,"Traceback (most recent call last):
1721749163172,"File ""/var/lang/lib/python3.12/site-packages/awslambdaric/lambda_runtime_client.py"", line 85, in wait_next_invocation
1721749163173,"future = executor.submit(runtime_client.next)
1721749163173,"^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1721749163173,"File ""/var/lang/lib/python3.12/concurrent/futures/thread.py"", line 179, in submit
1721749163173,"self._adjust_thread_count()
1721749163173,"File ""/var/lang/lib/python3.12/concurrent/futures/thread.py"", line 202, in _adjust_thread_count
1721749163174,"t.start()
1721749163174,"File ""/var/task/sentry_sdk/integrations/threading.py"", line 56, in sentry_start
1721749163174,"return old_start(self, *a, **kw)
1721749163174,"^^^^^^^^^^^^^^^^^^^^^^^^^
1721749163174,"File ""/var/lang/lib/python3.12/threading.py"", line 992, in start
1721749163174,"_start_new_thread(self._bootstrap, ())
1721749163174,"RuntimeError: can't start new thread
1721749163174,"During handling of the above exception, another exception occurred:
1721749163174,"Traceback (most recent call last):
1721749163174,"File ""/var/runtime/bootstrap.py"", line 63, in <module>
1721749163174,"main()
1721749163174,"File ""/var/runtime/bootstrap.py"", line 60, in main
1721749163174,"awslambdaricmain.main([os.environ[""LAMBDA_TASK_ROOT""], os.environ[""_HANDLER""]])
1721749163174,"File ""/var/lang/lib/python3.12/site-packages/awslambdaric/__main__.py"", line 21, in main
1721749163174,"bootstrap.run(app_root, handler, lambda_runtime_api_addr)
1721749163174,"File ""/var/lang/lib/python3.12/site-packages/awslambdaric/bootstrap.py"", line 493, in run
1721749163174,"event_request = lambda_runtime_client.wait_next_invocation()
1721749163175,"^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1721749163175,"File ""/var/lang/lib/python3.12/site-packages/awslambdaric/lambda_runtime_client.py"", line 88, in wait_next_invocation
1721749163175,"raise FaultException(
1721749163175,"awslambdaric.lambda_runtime_exception.FaultException: ('Runtime.LambdaRuntimeClientError', ""LAMBDA_RUNTIME Failed to get next invocation: can't start new thread"", None)

Jul 25 '24 09:07 anton-demydov-zoral

Hey @anton-demydov-zoral, SDK 1.x is not developed anymore outside of security fixes. Can you try with the latest 2.x release to see if this is also an issue in 2.x? If yes, can you try my suggestions from above?