google-cloud-dotnet
google-cloud-dotnet copied to clipboard
Using built-in DI registration functions like `AddPublisherServiceApiClient` can potentially cause thread starvation during high load at startup.
Using the built-in DI registration functions like AddPublisherServiceApiClient
can potentially cause thread starvation during high load at startup.
Environment details
- OS: Debian 11 in Docker
- .NET version: 7.0
- Package name and version:
Google.Cloud.PubSub.V1->3.2.0
- Running on Google Cloud Run
Steps to reproduce
Unfortunately there is no way to reliably reproduce it, we've encountered this issue only twice in the past two months
-
Create a basic ASP.NET Core with the DI registration code in
Program.cs
:builder.Services.AddPublisherServiceApiClient();
-
Add a basic controller that accepts an
PublisherServiceApiClient
in the constructor and publishes Pub/Sub messages usingawait PublisherServiceApiClient.PublishAsync
-
Send a ton of traffic over a longer period of time.
When a new instance starts (e.g due to scaling), under rare circumstances this can cause the exact same issue as described here https://github.com/googleapis/google-api-dotnet-client/issues/1807, where the fetching of the Application Default Credentials might cause thread starvation.
To summarize this previous issue:
- Soon after application startup, you start receiving requests that try to obtain the Application Default Credentials from DI.
- All but one of the request thread blocks until the singleton is ready, which will be ready once the default credential is fetched.
- More requests keep coming in, and keep blocking on either of the DI singletons. The thread pool starts to deplete.
- The "lucky" thread that went on to fetch the credential gets to a point where it needs to check whether we are running on Compute, and for that an HTTP request is made to the metadata server. This is an async operation that doesn't block the thread, which is freed to do other things (probably attent to yet another incoming request).
- Still more requests keep coming in and more threads keep getting blocked waiting for the credential to be fetched, stressing the thread pool even more. The thread pool is probably growing at this point, but very slowly at first. Requests start being queued up, and some may start timing out.
- When the async HTTP request to the metadata server ends it needs a thread to resume back into, and it's now competing for a thread from a starved thread pool. Basically, we need a thread to finish fetching the credential, while most threads are blocked waiting for the credential to be fetched.
- Eventually the thread pool grows enough, the credential fetching can resume (if under the time we timeout at else, you get the error you initially reported), and the rest of the threads unblock with time.
As you can see in that issue, we've been hit by this exact issue 3 years ago, but in a different service of ours, under different circumstances using some manual initialization logic.
The reason I'm reporting this again is that this is now happening to us using the build-in DI functions like AddPublisherServiceApiClient
, which one would assume to the right thing™ already and don't run into this issue.
Not sure what the best way to solve this is, as the async nature of the credentials fetching makes this inherently harder with DI. Maybe the library should have an internal synchronization mechanism when fetching the Application Default Credentials? Or should the documentation mention that we should do this manually beforehand?
Thanks for the detailed report.
I'll assign this to Amanda who looked into this previously... I suspect there are no "easy" fixes for this, but we'll see. (If we could tell ASP.NET Core to throttle requests for a little while, that could potentially help.)
The aspect I think I'd focus on is why threads are blocking rather than using async operations. If we're doing something there that could become async (or offer a new async overload) that might be helpful. But if it's the DI system itself which is blocking threads, that could be tricky.
OK, I might I have a fix for this. It's exactly the same issue as before, sorry we dindn't realize earlier and re-introduced it ourselves.
We need to discuss some things internally, but you can implement this workaround while we do so.
The workaround is basically to fetch the application default credential yourself so it has already been obtained by the time the clients are built. Note that you don't have to do anything with it, as the application default credential is cached the first time is fetched.
Your DI code would look like this:
// Workaround for https://github.com/googleapis/google-cloud-dotnet/issues/11092
GoogleCredential.GetApplicationDefaultCredential();
builder.Services.AddPublisherServiceApiClient();
I'll get back on this thread when we know more.
We are going to document the workaround (#11142) while we evaluate our options. So I'll downgrade this issue to a P3 not because we are not looking into it, but because there's a straightforward documented workaround.
I'll come back when I know more.