AspNetCore.Diagnostics.HealthChecks icon indicating copy to clipboard operation
AspNetCore.Diagnostics.HealthChecks copied to clipboard

Too frequent requests to healthy checks endpoints "/health/live" and "/health/ready" can cause unhealthy error logs irregularly

Open terryhuangbo opened this issue 2 years ago • 11 comments

Please, fill the following sections to help us fix the issue

What happened: When I use AspNetCore.HealthChecks.AzureServiceBus-5.1.1 and AspNetCore.HealthChecks.AzureStorage-5.0.1 AspNetCore.HealthChecks.AzureKeyVault-5.0.1 for health checking, requests to healthy check endpoints "/health/live" and "/health/ready" return healthy results, but too frequent requests can cause unhealthy error logs irregularly. Like bellowing: Health check AzureServiceBus-High completed after 5001.3016ms with status Unhealthy and description '(null)' System.Threading.Tasks.TaskCanceledException: The operation was canceled. at Azure.Core.CancellationHelper.ThrowOperationCanceledException(Exception innerException, CancellationToken cancellationToken) at Azure.Core.CancellationHelper.ThrowIfCancellationRequested(CancellationToken cancellationToken) at Azure.Core.Pipeline.ResponseBodyPolicy.ThrowIfCancellationRequestedOrTimeout(CancellationToken originalToken, CancellationToken timeoutToken, Exception inner, TimeSpan timeout) at Azure.Core.Pipeline.ResponseBodyPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory1 pipeline, Boolean async) at Azure.Core.Pipeline.RedirectPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory1 pipeline, Boolean async) at Azure.Core.Pipeline.RetryPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory1 pipeline, Boolean async) at Azure.Core.Pipeline.RetryPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory1 pipeline, Boolean async) at Azure.Core.Pipeline.HttpPipeline.SendRequestAsync(Request request, CancellationToken cancellationToken) at Azure.Messaging.ServiceBus.Administration.HttpRequestAndResponse.SendHttpRequestAsync(Request request, CancellationToken cancellationToken) at Azure.Messaging.ServiceBus.Administration.HttpRequestAndResponse.GetEntityAsync(String entityPath, String query, Boolean enrich, CancellationToken cancellationToken) at Azure.Messaging.ServiceBus.Administration.ServiceBusAdministrationClient.GetQueueRuntimePropertiesAsync(String name, CancellationToken cancellationToken) at HealthChecks.AzureServiceBus.AzureServiceBusQueueHealthCheck.CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken)

What you expected to happen: No error logs happen.

How to reproduce it (as minimally and precisely as possible): As described ahead. Source code sample: services.AddHealthChecks() .AddAzureBlobStorage(configuration["EmailTemplateBlobConnectionStr"], name: "AzureStorage", failureStatus: HealthStatus.Unhealthy, tags: new[] { HealthCheckConstants.ReadinessTag }) .AddAzureServiceBusQueue( name: "AzureServiceBus-High", connectionString: configuration["EmailDeliveryServiceBusConnection"], queueName: "emaildelivery-high", failureStatus: HealthStatus.Unhealthy, tags: new[] { HealthCheckConstants.ReadinessTag }) .AddAzureServiceBusQueue( name: "AzureServiceBus-Normal", connectionString: configuration["EmailDeliveryServiceBusConnection"], queueName: "emaildelivery-normal", failureStatus: HealthStatus.Unhealthy, tags: new[] { HealthCheckConstants.ReadinessTag }) .AddAzureServiceBusQueue( name: "AzureServiceBus-Low", connectionString: configuration["EmailDeliveryServiceBusConnection"], queueName: "emaildelivery-low", failureStatus: HealthStatus.Unhealthy, tags: new[] { HealthCheckConstants.ReadinessTag }) .AddAzureKeyVault( name: "AzureKeyVault", keyVaultServiceUri: new Uri($"https://{configuration["KeyVaultName"]}.vault.azure.net/"), credential: new DefaultAzureCredential(), setup: options => { options.AddSecret(configuration["KeyVaultSecrectName"]); }, failureStatus: HealthStatus.Unhealthy, tags: new[] { HealthCheckConstants.ReadinessTag }); Anything else we need to know?: Nothing Environment:

  • .NET Core version(net5.0)
  • Healthchecks version (5.x.x)
  • Operative system: Windows 10
  • Others: No 1630655557

WeChat Screenshot_20210903155428

terryhuangbo avatar Sep 03 '21 07:09 terryhuangbo

I'm also seeing this. Any solution?

thisispaulsmith avatar Sep 23 '21 14:09 thisispaulsmith

Hi

I can’t reproduce the issue! After seing the callstack this is a error on the retry in the servicebus connection.. ie, the check is working and notify a issue after 5 seconds pf retries!

unaizorrilla avatar Sep 23 '21 15:09 unaizorrilla

We don't see any of these errors under normal use. Only during healthcheck endpoints. We're at the point we're it's happening so often we're going to disable the specific health check.

thisispaulsmith avatar Sep 27 '21 11:09 thisispaulsmith

We're seeing this issue as well since we run regular live and health checks on the ServiceBus using AddAzureServiceBusQueue() to register the actual check. If you need more information please let me know.

Meandron avatar Sep 28 '21 13:09 Meandron

When I run the following code from a local console app, no exception is raised. The code more or less corresponds to the one in the AzureServiceBusQueueHealthCheck:

static async System.Threading.Tasks.Task Main(string[] args) { try { while (true) { ServiceBusAdministrationClient managementClient = new ServiceBusAdministrationClient(MYENDPOINT); var x = await managementClient.GetQueueRuntimePropertiesAsync(MYQUEUE, CancellationToken.None); Thread.Sleep(5000); } } catch (Exception ex) { throw ex; } }

I took this code as a custom implementation, added it to my .net core application and deployed it to k8s where there is a liveness probe with a period of 30s set. It raised the exceptions like @terryhuangbo described.

The infrastructure being checked was the same, i.e. the ConnectionString pointed to the equal service bus queue located on the cloud.

@unaizorrilla: Could it be there is a CancellationToken passed / present on the live system / .net core when performing the check that may cause the issue somehow? We're stuck... There is no other way for us to disable the liveness probe as the pods get into a recreation-loop with the probe keeps failing.

Meandron avatar Oct 01 '21 12:10 Meandron

When I run the following code from a local console app, no exception is raised. The code more or less corresponds to the one in the AzureServiceBusQueueHealthCheck:

static async System.Threading.Tasks.Task Main(string[] args) { try { while (true) { ServiceBusAdministrationClient managementClient = new ServiceBusAdministrationClient(MYENDPOINT); var x = await managementClient.GetQueueRuntimePropertiesAsync(MYQUEUE, CancellationToken.None); Thread.Sleep(5000); } } catch (Exception ex) { throw ex; } }

I took this code as a custom implementation, added it to my .net core application and deployed it to k8s where there is a liveness probe with a period of 30s set. It raised the exceptions like @terryhuangbo described.

The infrastructure being checked was the same, i.e. the ConnectionString pointed to the equal service bus queue located on the cloud.

@unaizorrilla: Could it be there is a CancellationToken passed / present on the live system / .net core when performing the check that may cause the issue somehow? We're stuck... There is no other way for us to disable the liveness probe as the pods get into a recreation-loop with the probe keeps failing.

It turned out increasing the liveness check's timeout (1sec -> 20sec) makes it better, but the error still shows up from time to time...

Meandron avatar Oct 01 '21 14:10 Meandron

while (true) { ServiceBusAdministrationClient managementClient = new ServiceBusAdministrationClient(MYCON); Stopwatch stopwatch = new Stopwatch(); stopwatch.Start(); var x = await managementClient.GetQueueRuntimePropertiesAsync(MYQUEUE, CancellationToken.None); stopwatch.Stop(); Console.WriteLine("Elapsed Time is {0} ms", stopwatch.ElapsedMilliseconds); Thread.Sleep(1000); }

Executed locally with using infra from cloud. Results are quite amazing...:

Elapsed Time is 90 ms Elapsed Time is 54 ms Elapsed Time is 39 ms Elapsed Time is 147 ms Elapsed Time is 289 ms Elapsed Time is 46 ms Elapsed Time is 541 ms Elapsed Time is 42 ms Elapsed Time is 234 ms Elapsed Time is 1059 ms Elapsed Time is 47 ms Elapsed Time is 215 ms Elapsed Time is 1514 ms Elapsed Time is 5314 ms Elapsed Time is 1442 ms

Meandron avatar Oct 04 '21 12:10 Meandron

We've had to remove this health check. Will watch this thread for updates. Even increasing the timeout didn't help.

thisispaulsmith avatar Oct 27 '21 13:10 thisispaulsmith

Is that issue still a thing @thisispaulsmith @Meandron ?

molinch avatar Jan 26 '22 06:01 molinch

Yes, it is still there. We worked around it by creating our own HealthCheck class which does pretty much the same but only logs warnings as otherwise our logs get spammed with "false" error logs.

We receive those kind of error log also for other health check types, like BlogStorage: BlobStorage

(Notice the completion time)

Meandron avatar Jan 26 '22 07:01 Meandron

I'm experiencing the same thing... It almost seems like we are getting throttled form the management client to hook into Azure with multiple replica's at the same servicebus.

By default you have 1000 credits, every management call (what this healthcheck is doing) costs 10 credits. If you have a bunch of replicas, it can count up to throtteling. See: https://docs.microsoft.com/en-us/azure/service-bus-messaging/service-bus-throttling#what-are-the-credit-limits

When using the premium service bus, no issues...

ferryferry avatar Apr 13 '22 14:04 ferryferry

I've the same issue with the NpgSqlHealthCheck:

Health check Database with status Unhealthy completed after 18.4543ms with message '(null)'
System.Threading.Tasks.TaskCanceledException:
   at Npgsql.TaskExtensions+<WithCancellation>d__1`1.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Npgsql.TaskExtensions+<WithTimeout>d__0`1.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Npgsql.Internal.NpgsqlConnector+<ConnectAsync>d__198.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Npgsql.Internal.NpgsqlConnector+<RawOpen>d__196.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Npgsql.Internal.NpgsqlConnector+<<Open>g__OpenCore|191_1>d.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Npgsql.Internal.NpgsqlConnector+<Open>d__191.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Npgsql.UnpooledConnectorSource+<Get>d__6.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Npgsql.NpgsqlConnection+<<Open>g__OpenAsync|45_0>d.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at HealthChecks.NpgSql.NpgSqlHealthCheck+<CheckHealthAsync>d__4.MoveNext (HealthChecks.NpgSql, Version=6.0.2.0, Culture=neutral, PublicKeyToken=null: /_/src/HealthChecks.NpgSql/NpgSqlHealthCheck.cs:42)

ddegasperi avatar Dec 23 '22 09:12 ddegasperi

As @ferryferry said the core issue is (likely) rate limiting on server side so nothing to do on the caller side - you just can choose appropriate intervals for checks or use premium service bus.

@ddegasperi Let's track npgsql in another issue.

sungam3r avatar Dec 23 '22 12:12 sungam3r

@ddegasperi Let's track npgsql in another issue.

@sungam3r ok, I've created a new issue #1611

ddegasperi avatar Jan 02 '23 08:01 ddegasperi

@sungam3r I have the same issue with the healthcheck and checked logs for throttled requests and there are non of them. Also the limit is 1000 requests per second (where management requests count as 10 and regular reads and writes as 1), so I highly doubt that issue is related to throttling

isaevdan avatar Mar 20 '23 12:03 isaevdan

Anyone was ever able to find a solution for this?

robinmanuelthiel avatar Dec 23 '23 13:12 robinmanuelthiel

I'm having this issue with Npgsql as well. .NET 8.0.2.

rfink avatar Feb 16 '24 13:02 rfink