cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

introduction of the key visualizer, for internal use only.

Open zachlite opened this issue 3 years ago β€’ 1 comments
trafficstars

This PR introduces the Key Visualizer πŸ” πŸ”₯ πŸͺ³

Screen Shot 2022-09-20 at 5 27 05 PM Screen Shot 2022-09-20 at 5 52 34 PM

The Key Visualizer is designed for multi-tenancy, but is currently only implemented for use by the system tenant. Enabling usage by secondary tenants is on the KV-Observability team's roadmap for 23.1. Read more about current limitations below. In the interim, we'd like to make it available to internal teams for testing and evaluation.

Usage and configuration

The Key Visualizer is disabled by default. It can be enabled via:

SET CLUSTER SETTING keyvisualizer.job.enabled = true;

The default sample period is 10 seconds, and the sample retention period is 2 weeks. The sample period can be configured via:

SET CLUSTER SETTING keyvisualizer.job.sample_interval = <DURATION>;

The default sample period was kept short for faster debugging cycles. The default we settle on in the future is subject to feedback. If the default sample period is kept short, we'll need to lower the retention period.

Attention needed

Can I have some eyes on the following bits of this PR:

What should happen if there's a rangefeed error? Referring to this, I'm not sure how to handle this error. An error means, for whatever reason, tenants can't successfully communicate their desired boundaries to the collector. Should the collector turn itself off? Should the collector try to restart the rangefeed?

Did I write sane SQL queries?

  • Referring to this package, here's where the tenant interacts with its system tables to read and update samples.
  • Here are the table definitions

Does the collector need a mutex?

  • Referring to this, where the collector serves samples when requested.

Known issues, and improvements for 23.1

  • Support for secondary tenants, as discussed.

  • Downsampling strategy. The current downsampling strategy is implemented here. It's not bad, but it can be better. There's currently no guarantee of boundary stability between samples, and there are other heuristics to explore to prioritize preserving resolution in "important" parts of the keyspace.

  • Improved fault tolerance in the collector. Requests from tenants may fail, and secondary tenants can disappear altogether.

  • UI improvements

    • Zoom improvements
    • Prevent overlapping of X and Y axis labels
    • The time scrubbing function is buggy. It works for all intervals except the 10 minute interval.
  • No handling of cluster version changes

zachlite avatar Sep 21 '22 15:09 zachlite

This change is Reviewable

cockroach-teamcity avatar Sep 21 '22 15:09 cockroach-teamcity

bors r+

zachlite avatar Jan 10 '23 23:01 zachlite

bors single on

rail avatar Jan 11 '23 16:01 rail

bors r-

rail avatar Jan 11 '23 16:01 rail

Canceled.

craig[bot] avatar Jan 11 '23 16:01 craig[bot]

bors r+

zachlite avatar Jan 24 '23 20:01 zachlite

Build succeeded:

craig[bot] avatar Jan 24 '23 22:01 craig[bot]