xtdb XTDB Monitoring Dashboard

Wider Context

We want to make the overarching story of monitoring XTDB a better experience for users. Based on observed prior art and what we currently have, I believe we should have a monitoring stack split into a few distinct parts:

An overarching cluster operation monitoring dashboard
- Used for health checking across the cluster.
- The first place a user should go to understand their deployment.
- Similar to those observed for other databases with the expected sort of info that a user might expect.
The "XTDB Debugging" dashboard
- Sort of like our existing one - which we've been using when running auctionmark.
- Less of a "cluster overview", more of a per node thing for finding useful debugging metrics/monitors.
We could include some JVM monitors in that second dashboard, or just recommend a separate (pre-existing) JVM monitoring dashboard.
- Could also include/copy contents of one into our debugging dashboard, but might be better to have as a separate dashboard.

This card concerns us adding in the top level XTDB monitoring dashboard, and what that will include.

XTDB Cluster Monitoring Dashboard details

A board with ideas captured for this monitoring dashboard based on prior art:

Noting what expectations I would have for a "Monitoring dashboard" based on some of the others we've looked at:

Should have graphs/stats that make sense cluster wide.
- With an ability to filter on a per node basis
In terms of what we've seen from other similar dashboards
- Count of members in the cluster
- Transaction Lag.
- Query / Transaction (read/write) throughput (Queries/Transactions Per Second)
  - Summed over cluster, unless filtered by node.
- Query / Transaction (read/write) latency (average, 99th percentile)
  - Summed over cluster, unless filtered by node.
- CPU and memory usage
  - JVM should provide most of this for us.
- Disk I/O usage
  - Ops/second, spend of read/writes
  - One we'll need to make sense of - For XTDB, we're probably far more interested in terms of "whats on object storage?" than local disk itself.
  - Wouldn't really want this to be cloud specific/require outside datasources, perhaps gathered by the nodes somehow.
- Connection counts
  - Very common to other monitoring dashboards - how do we make sense of active connections? PGwire/Jetty perhaps?
- Error info:
  - Seen a few dashboards with a quick view over some errors - seems fairly immediately useful.
  - What error logs have we seen?
  - Probably based on logs? - Would need to gather error logs somehow? (Perhaps via Loki?)
  - Error counts
    - Query failures/errors?
    - Transaction failures/errors/timeouts?

Extra Meters/Gauges required?

Meters we would need to add to support all of the pieces in here:

Tx lag / tx id monitoring.
"Query_failed" and "Transaction_failed", essentially, whenever either fails for whatever reason, can mark an error count.
"Query_Timeout" and "Transaction_Timeout" same as the above but for timeouts.
- Could potentially be part of the "failed" count, and we could mark a label against it, ie, "tx_failed{type: timeout}".
- Do queries timeout??
Buffer pool meters - these form our "Disk Usage" monitoring:
- "Bytes_Written" / "Bytes_Read" (from putBuffer/getBuffer) for getting disk usage rates.
- "GetBuffer" timer for understanding "get time from blob storage".
- "PutBuffer" timer (also for multipart) for understanding "put time from blob storage"
Some kind of "Compaction Job timer"
- Can use this to get an idea of how long compaction jobs take.
- Can also use this to get a count of compaction jobs.
- Realize it's a slightly lower level detail though have seen similar on other dashboards and it's pretty large source of memory usage by XTDB
Cache hit/miss metrics (want these split by cache)
- Might be provided these by Caffeine itself.
- Arguable if should be in the monitoring board they ARE somewhat important for performance and show up in a lot of other examples.
Active/Total Connections metrics.
- May be a bit bespoke we mark these on Jetty/Pgwire connections.

Previously done work

For each of the above, particularly for Queries, Transactions, JVM metrics and overall health, we will want to look at prior art/pre existing dashboards for inspiration of what things people want to see/understand when running their database application.

[x] Look at prior art for monitoring databases.
[x] #3871
[x] Gathering application logs to Grafana? (Perhaps using Loki) - @Akeboshiwind
[x] #3882

TODOs:

[x] #3896
[ ] #3897
Adding/supporting new gauges & visualizations to the cluster monitoring dashboard:
- [x] #3898
- [x] #3899
- [x] #3900
- [x] #3901
- [ ] Add monitoring of Cache hit/miss metrics, (split by cache).
- [ ] Add monitoring of the Compaction process

Nov 14 '24 16:11 danmason

We can make a separate card for this, but as well as our monitoring dashboard I believe we should have a separate debugging dashboard (based on what currently have). Previous thinking of contents there would include:

Queries (Read)
- (timers, counts, rates, etc)
- On top of this, some ability to find/save/examine what the slow running queries are
Transactions (write)
- (timers, counts, rates, etc)
JVM metrics.
Direct memory usage graphs.
Disk usage metrics (for all non memory-only bufferpools)
Buffer pool ops.
Indexer ops.
Compactor ops.

Nov 18 '24 15:11 danmason

Just noting for later reference, micrometer has a number of useful other MeterBinder we could use: https://www.javadoc.io/doc/io.micrometer/micrometer-core/1.1.0/io/micrometer/core/instrument/binder/MeterBinder.html

Of particular interest:

CaffeineCacheMetrics - for cache hits/misses, other metrics?
DiskSpaceMetrics - potentially useful for measuring local disk cache usage?
JettyStatisticsMetrics - for measuring HTTP connections et al?
- Probably better with smething bespoke since "connections" are shared between HTTP and PGWire?

Nov 18 '24 17:11 danmason