kvserver: improve tenant_id CPU observability
Motivated by https://github.com/cockroachlabs/support/issues/3297.
This commit
- adds a new
replicas.cpunanospersecondmetric, which aggregates the Replica ReqCPUNanosPerSecond at the tenant level. - Adds a tenant_id tag to CPU profiles.
This should simplify investigations related to tenant-induced overload: the new metric should often help pinpoint the set of hot tenants. CPU profiles can help dig into the specific code paths this tenant is exercising. This can then be rounded out with the existing metrics for request counts by tenant (both received by KV and sent by the tenant Pod) for a comprehensive picture.
Release note: the replicas.cpunanospersecond metric was added. Notably, when
child labels are enabled, it exposes evaluation-related Replica CPU usage by
tenant.
Epic: none
Had to add some preliminary cleanup commits to address data races - the correct use of tenant metrics is notoriously tricky because the metrics must not be used once the replica is destroyed (which could drop the tenant's refcount to zero, which would destroy the metrics object as well). It's really hard to synchronize cheaply with replica destruction, since we haven't made it straightforward.
I simplified the metrics lifecycle management so that it's now more obvious (hopefully) how this works. The new metric is then updated by acquiring and then releasing a reference on the metrics object. Instead of the race (which thankfully was caught in CI) the behavior would now be that after the metrics reference is released, we make a new one, make one bogus update, and release that one again. But this is all well-defined behavior that "looks" exactly like the normal lifecycle of the metrics, and it's a good way to handle this annoying special case.
I'll request a review once CI is happy.
Tftr!
bors r+