cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

Provide E2E SQL and/or KV latency & error rate time-series metrics that indicate issues with CRDB rather than poor workload given cluster provisioning

Open joshimhoff opened this issue 2 years ago • 20 comments

TLDR: Skip to the "Describe the solution you'd like" section!!!

Is your feature request related to a problem? Please describe. In case of our cloud product:

  • We want to page CRL eng including SRE in case of urgent & actionable issues with CRDB.
  • Corollary to above: We do NOT want to page CRL eng including SRE in case a customer writes a poor workload given how the cluster is provisioned (which in dedicated the customer controls).

The need for above is clear in case of our cloud product. But on-prem would benefit too. Being able to quickly tell apart issues with CRDB apart from issues with a customer workload makes DB eng, SRE, TSE, and customers more efficient.

Describe alternatives you've considered The alternative is the current CC monitoring stack. We alert on various signals of definite trouble. For example:

  1. If the sqlprober can't SELECT a test row, page SRE.
  2. If the kvprober can't read & shadow write a key dedicated to probing on a "randomly" chosen range, page SRE.
  3. If proposals are "stuck in raft", that is, they don't get committed in a timely manner, page SRE.

This is fine. If above signals indicate trouble, there is trouble, as the the above signals don't close on contention or other workload issues (in case of prober the workload is controlled by CRL eng thus not problematic). That is, the false positive rate is low [1].

[1] modulo that until admission control rolls out we will paged when a dedicated cluster is out of CPU

What's the problem then? The false negative rate is too high! This means we miss outages in production. It also means we don't have confidence that IF no alerts are firing, it is a workload issue. The latter possibly matters more to operational efficiency than the former, tho both are important.

Examples of outages above alerting config might miss:

  1. Issues with the SQL table leasing machinery (unless issue happened to affect the tiny sqlprober tables).
  2. Badly shaped LSMs (unless so badly shaped that kvprober writes slowed down).
  3. kvserver deadlocks (could be caught by kvprober but only slowly given randomized range selection)
  4. There are way way more surely...

Describe the solution you'd like What matters to users is that SQL is available and served at low latencies. Many many many different concrete production issues lead to high SQL latency or error rates. Why don't we alert on SQL latency & error rate then? Doing this would lead to a low false negative rate, as if an issue actually matters to a customer, it translates to SQL to be either not available or at high latencies. Alerting on the symptoms the user cares about is great. Why alert on anything else? If we are not providing to users what they care about, of course we should page CRL eng including SRE!!!

The obvious issue is that doing this naively will lead to alerting on poor workload given how the cluster is provisioned, which will push ops load way way up & burn us all out. For example:

  1. If a cluster is out of CPU, latencies will shoot up.
  2. If a workload includes significant contention, latencies will shoot up.
  3. If a workload does lots of full table scans, latencies will shoot up.

The crux is: How can we measure E2E SQL latency & error rate via time-series metrics in a way that doesn't include the above workload issues? For now let's just consider latency.

For 1 & 2, we need to measure latency while excluding the portion of latency contributed by certain components:

  1. Don't include latency incurred sitting in admission control queues.
  2. Don't include latency incurred sitting in any of the KV concurrency machinery (lock table? etc.).

3 is harder! One idea is bucketing queries by their expected cost and setting different alerting thresholds on the buckets (e.g. full table scans wouldn't have SLA but single row reads would). Another idea is normalizing the measured latency by expected cost, e.g. you can imagine dividing each full table scan latency by the number of rows read or similar. These are half ideas only. To the best of my knowledge, the bucketing approach has worked well on a Google time-series database. That database does not provide an interface as flexible & thus easy to foot-gun yourself as SQL. But still I contend that 3 is possible for CRDB too.

CC @JuanLeon1 @bdarnell @tbg @andreimatei @jordanlewis @jreut @udnay @sumeerbhola for thoughts!

Additional context If anyone wants links to a CC specific design doc about monitoring as extra context, plz say so.

Jira issue: CRDB-10421

Epic CC-7894

joshimhoff avatar Oct 05 '21 19:10 joshimhoff

Some convo with @sumeerbhola, @JuanLeon1, and @tbg:

Sumeer says:

I just read https://github.com/cockroachdb/cockroach/issues/71169. This is typically hard, and I don’t know any storage/db service in google that managed to do this (but my knowledge there is incomplete). It would be worth looking around for documentation/literature on how other systems have done this.

Josh says:

i appreciate that you know this is hard

i will look for lit

but also Juan says monarch had this

granted not a SQL DB.. no transactionality … interface considerably less rich than SQL .. etc.

?

Sumeer says:

Most of the data in monarch had shared and well understood schemas, so monarch SRE could craft read-only queries against certain “user data” that were guaranteed to be simple. This won’t scale in the context of 100K+ different clusters, each with wildly different data. One could bucket query fingerprints based on past latency behavior (assuming that the system is healthy most of the time), and when that changes significantly without being correlated with high resource utilization, there is a potential problem. I am not sure how many false positives that will still create. Even if we don’t alert on this, such bucketing and being able to see timeseries graphs for some of the more common buckets would help both SRE if they get escalated to, and users for self-diagnosis.

Josh says:

ack i thought user traffic was truly what was being alerted on but this is sort of probing, just clever probing of user data. gotcha. ill talk to Juan too to better understand

ya +1. the project can be thought of as a sorta risky research project...

this is why i think we started off with the prober heavy alerting stack we have, which doesnt create false positives modulo when out of CPU (and that will be fixed by admission control)

Josh talks to Juan. Josh says:

ok i was wrong about what Juan said. he said SRE did two things with monarch:

  1. lots of probing, on totally synthetic data. like sqlprober roughly but with higher quantity and richer synthetic data
  2. alerting on user-traffic latencies

For 2 tho, its a shared service and so there was a lot of workload in the alerting window. that is, no one user could change the p99 latency by changing the workload they sent. he also mentions they classified queries as small, medium, or large and then had separate alerts for each class

so for 2, they did not do any complex thing to disentagle bad workload from bad monarch. they just counted on having enough workload in the alerting window for it to be a non-issue

we could consider doing 2 for serverless. maybe it would work fine. maybe you need to solve only part of https://github.com/cockroachdb/cockroach/issues/71169, e.g. you can filter out latency waiting in SQL admission control queues and concurrency machinery but not actually properly cost a SQL query, as to handle that you just rely on large number of queries in the alerting window

if serverless is our focus, maybe we should focus on serverless….

joshimhoff avatar Oct 07 '21 13:10 joshimhoff

I think it's going to be quite hard to "exclude" the right latencies to make a clear signal here. SQL is such a complicated language that it makes it very difficult to know when a query is operating at expected latencies or not. The closest thing that we have to this idea is the coster, but the coster isn't designed to have units that are closely related to latency - costs will not nicely compare with latencies using some linear scale or anything like that.

From SQL's perspective, if "SQL processing" itself is ever slower than some constant factor of the number of input rows, that's probably unexpected, modulo disk spilling, which is expected and makes everything many times slower (but is desirable nonetheless).

I'd say that the most fruitful thing we could do would be measure the KV request latencies and try to pay attention to see whether they're slower than expected. KV is our simple interface. Maybe we ought to consider enhancing KV request latencies or tracing, rather than try to do it for all of SQL which will be extremely complex and error prone.

Basically what I'm trying to say - if KV requests are slow, SQL queries will be slow. If KV requests are fast, SQL queries will probably be fast, modulo things like disk spilling or inter-datacenter latencies.

jordanlewis avatar Oct 08 '21 04:10 jordanlewis

I'd say that the most fruitful thing we could do would be measure the KV request latencies and try to pay attention to see whether they're slower than expected. KV is our simple interface.

Ya I agree with this idea. I like both of these ideas as ways to work around the difficulty of costing SQL:

  1. Measure KV latencies instead.
  2. (as per https://github.com/cockroachdb/cockroach/issues/71169#issuecomment-937818376) Measure aggregate SQL latency on host clusters, excluding stuff like sitting in admission control queues & being blocked in KV concurrency queues. Don't cost SQL. Simply rely on having a lot of load on a host cluster implying a fairly stable p90 or p99 latency even without costing it.

I think I like 2 more than 1 tho 2 won't work for dedicated or on-prem obvi. Also 1 is prob needed to do 2 anyway.

SQL is such a complicated language that it makes it very difficult to know when a query is operating at expected latencies or not. The closest thing that we have to this idea is the coster, but the coster isn't designed to have units that are closely related to latency - costs will not nicely compare with latencies using some linear scale or anything like that.

Mhm. The coster does need to cost AHEAD of running the query, right? For the use case in this ticket, we can cost AFTER running the query. A bit easier, no? Still hard; I hear you.

joshimhoff avatar Oct 08 '21 13:10 joshimhoff

At the KV level, what I would like to achieve is to be able to account for the time spent in each "stage" via metrics. Anything that is recorded there can (for the most part) also be returned via structured trace payloads (for requests for which this is desired). What is not covered here are tight SLAs, but clearly you can't enforce an SLA when you don't even measure the underlying quantities. However, structured metadata are a natural stepping stone towards SLAs as they allow a programmatic construction of an expectation that can then be compared against the timings of the actual operation.

tbg avatar Oct 13 '21 10:10 tbg

Could some of the work done for this initiative Predictability of foreground performance help build out metrics for when foreground latency starts to tick up as a data point to correlate with SQL queries?

udnay avatar Oct 13 '21 14:10 udnay

Returning to this ahead of 22.2 planning. The idea is to measure E2E latency and error rate while excluding latency and errors caused by workload choices. For example, latency in replication should be tracked, but latency waiting for latches should not be tracked. Metrics of this kind would allow for some great SRE-facing alerts!! Metrics of this kind would also make CRDB more observable.

The discussion here suggests doing this at the KV layer rather than the SQL layer, mainly because costing SQL queries is hard given the complexity of the interface. I buy this.

I think we should consider executing on this ticket (at the KV layer) during the 22.2 cycle. @tbg, I never saw https://github.com/cockroachdb/cockroach/pull/72092/files. It's so cool!!! Do you have a sense of scope for this ticket given your explorations?

CC @JeffSwenson as we have been talking about KV alerting in the context of a recent serverless incident.

joshimhoff avatar Feb 17 '22 14:02 joshimhoff

There are a few things that make good performance monitoring extra valuable to Serverless:

  1. We are mixing multiple workloads on a single storage cluster. We need a way to determine if a Serverless cluster is impacted by noisy neighbors.
  2. Having good performance metrics allows us to evaluate how well auto scaling is working.
  3. I expect many Serverless clusters will be nearly idle. We need to keep probing rates low to support large numbers of mostly idle clusters. Low probing rates increases the probability of probers missing issues that impact customer clusters.

JeffSwenson avatar Feb 17 '22 17:02 JeffSwenson

@andy-kimball makes a great point that we don't need to do this perfectly for it be very useful, def for cloud & prob for on-prem.

  • Concretely, we can select a subset of SQL and/or kvclient queries with a known & mostly fixed fixed latency profile. For example, we can export E2E SQL latency metrics re: the latency of single row reads and writes by primary key (maybe even with secondary indexes allowed). Or we can export E2E kvclient latency metrics re: the latency of calls to Get.
  • For both example above, we measure latency while taking care to exclude time waiting on workload-dependent stuff, e.g. we exclude time waiting for the concurrency control machinery to sequence requests.
  • At the end of doing all this, for an important subset of SQL and KV, we have time-series metrics that track E2E latency degradations not expected regardless of workload choices by the customer.
  • We can expand the instrumented subset over time.

What do you think about this, @jordanlewis? This "choose an important subset" idea is an answer to your point that costing SQL queries is hard. It is hard to do in the general case! But it's not so bad, if we only "cost" SQL queries that are easy to cost.

I think it's going to be quite hard to "exclude" the right latencies to make a clear signal here. SQL is such a complicated language that it makes it very difficult to know when a query is operating at expected latencies or not. The closest thing that we have to this idea is the coster, but the coster isn't designed to have units that are closely related to latency - costs will not nicely compare with latencies using some linear scale or anything like that.

joshimhoff avatar Feb 23 '22 18:02 joshimhoff

One thing that comes up as a desire from time to time is the ability to have time series metrics labeled by app. This is cost prohibitive in general because people can put arbitrary app names, and you can't have a system that generates arbitrarily many metrics.

But what if we implemented this with an allowlist that permitted configuration of which apps to emit bucketed metrics for? Then, you could allowlist queries that are emitted by systems that we understand, like the probers. Then we'd have metrics that we already understand getting exported, but scoped just to these apps.

This doesn't accomplish the goal of automatically finding simple queries from user workloads, but maybe it would be helpful anyway?

I feel like getting metrics just for simple queries is possibly interesting. There's definitely some challenges with it, because some of the work that the system does happens before we know whether a query is "simple" or not (the whole optimizer has to run), and the optimizer needs to call the leasing system which can do lookups and stuff. Then by the time we know the query is simple, we also know that the only real work to be done is starting in kvclient and below...

Speaking of leasing though, are we already collecting metrics on the leasing / namespace / descriptor table retrieval times? If not that would be a very low hanging fruit thing to add and I do think it'd help SRE understand whether the system is working well or not. cc @postamar @ajwerner

jordanlewis avatar Feb 23 '22 18:02 jordanlewis

But what if we implemented this with an allowlist that permitted configuration of which apps to emit bucketed metrics for?

I think could be useful! OTOH I don't see how it would help with SRE alerting.

There's definitely some challenges with it, because some of the work that the system does happens before we know whether a query is "simple" or not (the whole optimizer has to run), and the optimizer needs to call the leasing system which can do lookups and stuff. Then by the time we know the query is simple, we also know that the only real work to be done is starting in kvclient and below...

I'm not seeing where the challenge is yet. We can just always measure latency. But we write the measured latency to metrics only if the query is simple (e.g. a single row read / write). We need to know whether to write to metrics when the query finishes executing but not before. So it's fine that the optimizer needs to run first. Can you explain a bit more if I'm missing the point you are making?

Speaking of leasing though, are we already collecting metrics on the leasing / namespace / descriptor table retrieval times? If not that would be a very low hanging fruit thing to add and I do think it'd help SRE understand whether the system is working well or not.

Ya good idea. But I really think we should pay the cost to get some nice E2E measurement of system performance that mostly doesn't depend on workload choices made by the customer. It's a really powerful thing to have in my experience.

joshimhoff avatar Feb 23 '22 21:02 joshimhoff

I'm not seeing where the challenge is yet. We can just always measure latency. But we write the measured latency to metrics only if the query is simple (e.g. a single row read / write). We need to know whether to write to metrics when the query finishes executing but not before. So it's fine that the optimizer needs to run first. Can you explain a bit more if I'm missing the point you are making?

Oh, yeah, that's a good point! I was confused and misunderstanding something. I like this idea.

We could have a latency metric that gets recorded for user-emitted queries under a certain cost. It seems reasonable that we could define an SLO on that. I don't know how we'd identify what that cost would be. @rytaft what do you think about this idea?

jordanlewis avatar Feb 23 '22 22:02 jordanlewis

I'll summarize for @rytaft since this ticket is long. We want latency metrics that track what our customers are experiencing E2E, as E2E behavior is what users actually care about (KV is an implementation detail of CRDB). In a cloud context, we want those metrics to indicate problems when CRDB is broken and/or needs CRL eng including SRE to take action but NOT when a customer simply needs to change their workload.

Here is a strawman metric that has these properties:

  1. If SQL query is not single row read or write, do nothing.
  2. If SQL query is a single row read or write, measure latency. 2a. When measuring latency, exclude sources of latency that are expected to be high depending on workload choices, e.g. exclude latency waiting on the concurrency manager to sequence requests (since contention is (often) a function of workload) and exclude latency waiting in admission control queues (since hardware overload is (often) a function of workload).

Via something like the above, we have a measure of single row read & write latency that should be stable, regardless of what workload a customer throws at a cluster.

A quick footnote!

since contention is (often) a function of workload since hardware overload is (often) a function of workload

This is an assumption! Bugs in CRDB can lead to contention and hardware overload. Automatic alerting that faces SREs must make such assumptions. The idea is that we don't track sources of latency that most often are caused by workload choices rather than CRDB misbehavior, as we need our alerting false positive rate to be very very low.

joshimhoff avatar Feb 23 '22 22:02 joshimhoff

Via something like the above, we have a measure of single row read & write latency that should be stable, regardless of what workload a customer throws at a cluster.

Not really, for example if the customer has a long-running txn open it's really their fault but it can completely block single-row operations. So we need to measure only some components of these single-row operations, notably excluding contention time, which I started exploring a building block for in https://github.com/cockroachdb/cockroach/pull/72092.

tbg avatar Feb 24 '22 09:02 tbg

Apologies, you point out that exact thing, not sure how I skimmed past that. Anyway, I like the idea of applying this to a suitable subset of customer traffic since that's what the customer truly cares about. It also seems to decompose cleanly into a SQL and KV component. Within KV, with something like https://github.com/cockroachdb/cockroach/pull/72092 in place, we could apply your strategy to all non-range operations (or even range operations that didn't exceed a certain cost, perhaps measured in evaluation cost such as iterator movements, etc), so if an E2E measurement reports high latency, it could quickly be pinned to either the KV or SQL side.

tbg avatar Feb 24 '22 09:02 tbg

On the SQL side, it's actually not that hard to know if a query is "simple" (e.g., affecting a single row). We already do such a thing to determine whether a bounded staleness read is allowed: https://github.com/cockroachdb/cockroach/blob/c518952e739509c8cfe8e3a93c2b6ffb3002b866/pkg/sql/opt/exec/execbuilder/relational.go#L601-L623

I would not recommend using the optimizer cost for this. It is not designed to be meaningful outside the context of a single query.

Assuming the query is "simple" according to our definition, we could potentially enable the use of @tbg's metric.Timing apparatus throughout both SQL and KV. Is that what you were thinking as well, @tbg?

rytaft avatar Feb 24 '22 13:02 rytaft

we could potentially enable the use of @tbg's metric.Timing apparatus throughout both SQL and KV

It's been a while since I typed the Timing thing down so I don't want to advocate too hard for it, but yes, hopefully there is something that we can share between layers.

tbg avatar Feb 24 '22 14:02 tbg

Anyone have a rough idea of how much eng work this one is? Obviously scope is not decided yet, and so we have some flexibility. Just looking for a rough rough idea.

joshimhoff avatar Feb 24 '22 15:02 joshimhoff

Probably a few weeks of work for one engineer on SQL Queries + whatever is needed from kv. (This is assuming that we can reuse the Timing thing in SQL)

rytaft avatar Feb 24 '22 16:02 rytaft

cc @vy-ton

yuzefovich avatar Mar 22 '22 18:03 yuzefovich

I filed https://github.com/cockroachdb/cockroach/issues/82200 to cover the KV portion of this work (or at least get us to a place where it's not a big lift to integrate with whatever is happening "above KV" to do this issue. If my understanding is correct (cc @nkodali), KV Observability is going to work on #82200 this cycle (and I will advise).

I'll add that to me it is pretty unclear what the commitment to this issue (E2E latency) is. #82200 is independently valuable for the KV area, so we'd like to do it regardless. I don't see anything scheduled in SQL, so my expectation is that this issue will sit dormant but hopefully #82200 can independently be of use for the SREs.

tbg avatar Jun 01 '22 07:06 tbg

A much smaller work item related to this ticket is tracked at https://github.com/cockroachdb/cockroach/issues/98305. I hope to implement the linked ticket in the next six months as part of work on disaggregated storage. Understanding what read latencies we achieve on CC will help us deliver that project.

joshimhoff avatar Mar 09 '23 15:03 joshimhoff