AspNetCore.Diagnostics.HealthChecks icon indicating copy to clipboard operation
AspNetCore.Diagnostics.HealthChecks copied to clipboard

The DocumentDbHealthCheck makes calls to OpenAsync() for each health check that should only happen once

Open nickkarpuk opened this issue 4 years ago • 6 comments

What happened: DocumentDbHealthCheck makes a call to OpenAsync() on every check but that causes metadata requests to query databases and collections. Those queries go against internal metadata RU limit which users cannot control. So at some point with a bigger volume of requests when users breach this internal RU limit for metadata the health checks start failing even though the regular calls to query or insert data are fine. Here is explanation from Microsoft support: "To provide more clarity, there are two types of operations in Cosmos DB: On one hand operations on documents/data such as Create/Delete/Update/Read/Query are standard/transactional operations and their utilizations are measured against the RU/throughput provisioned on collections . On the other hand, operations on databases, collections, permissions and offers are considered master resource/metadata operations and the throughput for this type of operations is significantly smaller compare to the throughput for standard operations. Examples of master resources queries/operations are CreateDatabaseQuery and CreateDocumentCollectionQuery. "

OpenAsync method is just like that. You can see on the following screenshot where from Azure app insights we saw that a single health check that only uses DocumentDbHealthCheck is making multiple metadata queries: image

How to reproduce it (as minimally and precisely as possible): To get to a point where healthcheck fails you will need to call a health check very frequently or have a large number of services using the same cosmos DB container. We had about 3 dozen services each running health checks every 10 seconds when we started having the issues. It results in getting more and more 429 error responses from Cosmos DB which first increases the time it takes to perform a healthcheck and then starts making those checks fail. To see the queries going to cosmos DB you can run health check in VS and see requests going to cosmos DB in diagnostic tools.

Environment:

  • .NET Core version 2.2
  • Healthchecks version Any
  • Operative system: Any
  • Others:

nickkarpuk avatar Mar 26 '20 19:03 nickkarpuk

Hi @nickkarpuk

I try to address this, tomorrow! thanks for fill the issue!

unaizorrilla avatar Mar 31 '20 17:03 unaizorrilla

I'm interested in the solution to this, I don't think the default retry policy built into the Cosmos SDK client should be in play here. If there are retries involved, that's indicative of degraded performance at best, and even if eventually successful, it should be reported as degraded. That should be a signal to the ops team that additional scale is needed in cosmos to keep the instance healthy

kensykora avatar Oct 21 '20 20:10 kensykora

Seems this issue is hanging since 2020, and coming across this today. The same issue was opened against Microsoft's Cosmos DB SDK, and they closed and advised bringing this issue to this library. https://github.com/MicrosoftDocs/azure-docs/issues/50080

As the OP indicated, the documentation for OpenAsync() guides towards calling once to avoid a slow startup, but the method should not periodically be called to check health. As a recommendation, a SELECT 1 query would be a better health probe.

Further, be advised that CosmosDB guidance recommends to use a single client instance - the DocumentDbClient should not be created new in each health check. The design should allow for injection of the DocumentDbClient as a dependency so that it can be a singleton. Creation/disposal of DocumentDbClient objects can lead to ephemeral port exhaustion:

https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/performance-tips?tabs=trace-net-core#networking

robertbaumann avatar Nov 23 '22 23:11 robertbaumann

@robertbaumann PR is welcome. It seems to be rather easy to move OpenAsync out of each call and place some read/ping call instead.

sungam3r avatar Nov 25 '22 14:11 sungam3r

Hi everyone,

There has been a recent PR merged (#1985), which allows specyfing a database and collection name. If those arguments are provided, then a different check is executed:

documentDbClient.ReadDocumentCollectionAsync(...)

instead of documentDbClient.OpenAsync(...).

It's actually not feasible to change from OpenAsync(...) to query SELECT 1 FROM because you'd need a context of database and collection to execute a query.

So perhaps, the approach without database and collection name should not be a recommended way to check health from Cosmos DB? What do you think?

cieciurm avatar Aug 08 '23 06:08 cieciurm

Hi everyone,

There has been a recent PR merged (#1985), which allows specyfing a database and collection name. If those arguments are provided, then a different check is executed:

documentDbClient.ReadDocumentCollectionAsync(...)

instead of documentDbClient.OpenAsync(...).

It's actually not feasible to change from OpenAsync(...) to query SELECT 1 FROM because you'd need a context of database and collection to execute a query.

So perhaps, the approach without database and collection name should not be a recommended way to check health from Cosmos DB? What do you think?

That's what I've done here. We just check to see if the database id and container ids are not null or empty and then run a select 1 query for each container. I'm happy to raise a PR for it?

LeeAtCfc avatar May 10 '24 16:05 LeeAtCfc