cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

kvserver: improve tenant_id CPU observability

Open tbg opened this issue 7 months ago • 2 comments

Motivated by https://github.com/cockroachlabs/support/issues/3297.

This commit

  • adds a new replicas.cpunanospersecond metric, which aggregates the Replica ReqCPUNanosPerSecond at the tenant level.
  • Adds a tenant_id tag to CPU profiles.

This should simplify investigations related to tenant-induced overload: the new metric should often help pinpoint the set of hot tenants. CPU profiles can help dig into the specific code paths this tenant is exercising. This can then be rounded out with the existing metrics for request counts by tenant (both received by KV and sent by the tenant Pod) for a comprehensive picture.

Release note: the replicas.cpunanospersecond metric was added. Notably, when child labels are enabled, it exposes evaluation-related Replica CPU usage by tenant. Epic: none

tbg avatar May 12 '25 09:05 tbg

This change is Reviewable

cockroach-teamcity avatar May 12 '25 09:05 cockroach-teamcity

Had to add some preliminary cleanup commits to address data races - the correct use of tenant metrics is notoriously tricky because the metrics must not be used once the replica is destroyed (which could drop the tenant's refcount to zero, which would destroy the metrics object as well). It's really hard to synchronize cheaply with replica destruction, since we haven't made it straightforward.

I simplified the metrics lifecycle management so that it's now more obvious (hopefully) how this works. The new metric is then updated by acquiring and then releasing a reference on the metrics object. Instead of the race (which thankfully was caught in CI) the behavior would now be that after the metrics reference is released, we make a new one, make one bogus update, and release that one again. But this is all well-defined behavior that "looks" exactly like the normal lifecycle of the metrics, and it's a good way to handle this annoying special case.

I'll request a review once CI is happy.

tbg avatar Jun 13 '25 10:06 tbg

Tftr!

bors r+

tbg avatar Jun 24 '25 12:06 tbg