cockroach
cockroach copied to clipboard
introduction of the key visualizer, for internal use only.
This PR introduces the Key Visualizer π π₯ πͺ³
The Key Visualizer is designed for multi-tenancy, but is currently only implemented for use by the system tenant. Enabling usage by secondary tenants is on the KV-Observability team's roadmap for 23.1. Read more about current limitations below. In the interim, we'd like to make it available to internal teams for testing and evaluation.
Usage and configuration
The Key Visualizer is disabled by default. It can be enabled via:
SET CLUSTER SETTING keyvisualizer.job.enabled = true;
The default sample period is 10 seconds, and the sample retention period is 2 weeks. The sample period can be configured via:
SET CLUSTER SETTING keyvisualizer.job.sample_interval = <DURATION>;
The default sample period was kept short for faster debugging cycles. The default we settle on in the future is subject to feedback. If the default sample period is kept short, we'll need to lower the retention period.
Attention needed
Can I have some eyes on the following bits of this PR:
What should happen if there's a rangefeed error? Referring to this, I'm not sure how to handle this error. An error means, for whatever reason, tenants can't successfully communicate their desired boundaries to the collector. Should the collector turn itself off? Should the collector try to restart the rangefeed?
Did I write sane SQL queries?
- Referring to this package, here's where the tenant interacts with its system tables to read and update samples.
- Here are the table definitions
Does the collector need a mutex?
- Referring to this, where the collector serves samples when requested.
Known issues, and improvements for 23.1
-
Support for secondary tenants, as discussed.
-
Downsampling strategy. The current downsampling strategy is implemented here. It's not bad, but it can be better. There's currently no guarantee of boundary stability between samples, and there are other heuristics to explore to prioritize preserving resolution in "important" parts of the keyspace.
-
Improved fault tolerance in the collector. Requests from tenants may fail, and secondary tenants can disappear altogether.
-
UI improvements
- Zoom improvements
- Prevent overlapping of X and Y axis labels
- The time scrubbing function is buggy. It works for all intervals except the 10 minute interval.
-
No handling of cluster version changes
bors r+
bors single on
bors r-
Canceled.
bors r+