XTDB Monitoring Dashboard
Wider Context
We want to make the overarching story of monitoring XTDB a better experience for users. Based on observed prior art and what we currently have, I believe we should have a monitoring stack split into a few distinct parts:
- An overarching cluster operation monitoring dashboard
- Used for health checking across the cluster.
- The first place a user should go to understand their deployment.
- Similar to those observed for other databases with the expected sort of info that a user might expect.
- The "XTDB Debugging" dashboard
- Sort of like our existing one - which we've been using when running auctionmark.
- Less of a "cluster overview", more of a per node thing for finding useful debugging metrics/monitors.
- We could include some JVM monitors in that second dashboard, or just recommend a separate (pre-existing) JVM monitoring dashboard.
- Could also include/copy contents of one into our debugging dashboard, but might be better to have as a separate dashboard.
This card concerns us adding in the top level XTDB monitoring dashboard, and what that will include.
XTDB Cluster Monitoring Dashboard details
A board with ideas captured for this monitoring dashboard based on prior art:
Noting what expectations I would have for a "Monitoring dashboard" based on some of the others we've looked at:
- Should have graphs/stats that make sense cluster wide.
- With an ability to filter on a per node basis
- In terms of what we've seen from other similar dashboards
- Count of members in the cluster
- Transaction Lag.
- Query / Transaction (read/write) throughput (Queries/Transactions Per Second)
- Summed over cluster, unless filtered by node.
- Query / Transaction (read/write) latency (average, 99th percentile)
- Summed over cluster, unless filtered by node.
- CPU and memory usage
- JVM should provide most of this for us.
- Disk I/O usage
- Ops/second, spend of read/writes
- One we'll need to make sense of - For XTDB, we're probably far more interested in terms of "whats on object storage?" than local disk itself.
- Wouldn't really want this to be cloud specific/require outside datasources, perhaps gathered by the nodes somehow.
- Connection counts
- Very common to other monitoring dashboards - how do we make sense of active connections? PGwire/Jetty perhaps?
- Error info:
- Seen a few dashboards with a quick view over some errors - seems fairly immediately useful.
- What error logs have we seen?
- Probably based on logs? - Would need to gather error logs somehow? (Perhaps via Loki?)
- Error counts
- Query failures/errors?
- Transaction failures/errors/timeouts?
Extra Meters/Gauges required?
Meters we would need to add to support all of the pieces in here:
- Tx lag / tx id monitoring.
- "Query_failed" and "Transaction_failed", essentially, whenever either fails for whatever reason, can mark an error count.
- "Query_Timeout" and "Transaction_Timeout" same as the above but for timeouts.
- Could potentially be part of the "failed" count, and we could mark a label against it, ie, "tx_failed{type: timeout}".
- Do queries timeout??
- Buffer pool meters - these form our "Disk Usage" monitoring:
- "Bytes_Written" / "Bytes_Read" (from putBuffer/getBuffer) for getting disk usage rates.
- "GetBuffer" timer for understanding "get time from blob storage".
- "PutBuffer" timer (also for multipart) for understanding "put time from blob storage"
- Some kind of "Compaction Job timer"
- Can use this to get an idea of how long compaction jobs take.
- Can also use this to get a count of compaction jobs.
- Realize it's a slightly lower level detail though have seen similar on other dashboards and it's pretty large source of memory usage by XTDB
- Cache hit/miss metrics (want these split by cache)
- Might be provided these by Caffeine itself.
- Arguable if should be in the monitoring board they ARE somewhat important for performance and show up in a lot of other examples.
- Active/Total Connections metrics.
- May be a bit bespoke we mark these on Jetty/Pgwire connections.
Previously done work
For each of the above, particularly for Queries, Transactions, JVM metrics and overall health, we will want to look at prior art/pre existing dashboards for inspiration of what things people want to see/understand when running their database application.
- [x] Look at prior art for monitoring databases.
- [x] #3871
- [x] Gathering application logs to Grafana? (Perhaps using Loki) - @Akeboshiwind
- [x] #3882
TODOs:
- [x] #3896
- [ ] #3897
- Adding/supporting new gauges & visualizations to the cluster monitoring dashboard:
- [x] #3898
- [x] #3899
- [x] #3900
- [x] #3901
- [ ] Add monitoring of Cache hit/miss metrics, (split by cache).
- [ ] Add monitoring of the Compaction process
We can make a separate card for this, but as well as our monitoring dashboard I believe we should have a separate debugging dashboard (based on what currently have). Previous thinking of contents there would include:
- Queries (Read)
- (timers, counts, rates, etc)
- On top of this, some ability to find/save/examine what the slow running queries are
- Transactions (write)
- (timers, counts, rates, etc)
- JVM metrics.
- Direct memory usage graphs.
- Disk usage metrics (for all non memory-only bufferpools)
- Buffer pool ops.
- Indexer ops.
- Compactor ops.
Just noting for later reference, micrometer has a number of useful other MeterBinder we could use: https://www.javadoc.io/doc/io.micrometer/micrometer-core/1.1.0/io/micrometer/core/instrument/binder/MeterBinder.html
Of particular interest:
-
CaffeineCacheMetrics- for cache hits/misses, other metrics? -
DiskSpaceMetrics- potentially useful for measuring local disk cache usage? -
JettyStatisticsMetrics- for measuring HTTP connections et al?- Probably better with smething bespoke since "connections" are shared between HTTP and PGWire?