xtdb icon indicating copy to clipboard operation
xtdb copied to clipboard

XTDB Monitoring Dashboard

Open danmason opened this issue 1 year ago • 2 comments

Wider Context

We want to make the overarching story of monitoring XTDB a better experience for users. Based on observed prior art and what we currently have, I believe we should have a monitoring stack split into a few distinct parts:

  • An overarching cluster operation monitoring dashboard
    • Used for health checking across the cluster.
    • The first place a user should go to understand their deployment.
    • Similar to those observed for other databases with the expected sort of info that a user might expect.
  • The "XTDB Debugging" dashboard
    • Sort of like our existing one - which we've been using when running auctionmark.
    • Less of a "cluster overview", more of a per node thing for finding useful debugging metrics/monitors.
  • We could include some JVM monitors in that second dashboard, or just recommend a separate (pre-existing) JVM monitoring dashboard.
    • Could also include/copy contents of one into our debugging dashboard, but might be better to have as a separate dashboard.

This card concerns us adding in the top level XTDB monitoring dashboard, and what that will include.

XTDB Cluster Monitoring Dashboard details

A board with ideas captured for this monitoring dashboard based on prior art: image

Noting what expectations I would have for a "Monitoring dashboard" based on some of the others we've looked at:

  • Should have graphs/stats that make sense cluster wide.
    • With an ability to filter on a per node basis
  • In terms of what we've seen from other similar dashboards
    • Count of members in the cluster
    • Transaction Lag.
    • Query / Transaction (read/write) throughput (Queries/Transactions Per Second)
      • Summed over cluster, unless filtered by node.
    • Query /  Transaction (read/write) latency (average, 99th percentile)
      • Summed over cluster, unless filtered by node.
    • CPU and memory usage
      • JVM should provide most of this for us.
    • Disk I/O usage
      • Ops/second, spend of read/writes
      • One we'll need to make sense of - For XTDB, we're probably far more interested in terms of "whats on object storage?" than local disk itself.
      • Wouldn't really want this to be cloud specific/require outside datasources, perhaps gathered by the nodes somehow.
    • Connection counts
      • Very common to other monitoring dashboards - how do we make sense of active connections? PGwire/Jetty perhaps?
    • Error info:
      • Seen a few dashboards with a quick view over some errors - seems fairly immediately useful.
      • What error logs have we seen?
      • Probably based on logs? - Would need to gather error logs somehow? (Perhaps via Loki?)
      • Error counts
        • Query failures/errors?
        • Transaction failures/errors/timeouts?

Extra Meters/Gauges required?

Meters we would need to add to support all of the pieces in here:

  • Tx lag / tx id monitoring.
  • "Query_failed" and "Transaction_failed", essentially, whenever either fails for whatever reason, can mark an error count.
  • "Query_Timeout" and "Transaction_Timeout" same as the above but for timeouts.
    • Could potentially be part of the "failed" count, and we could mark a label against it, ie, "tx_failed{type: timeout}".
    • Do queries timeout??
  • Buffer pool meters - these form our "Disk Usage" monitoring:
    • "Bytes_Written" / "Bytes_Read" (from putBuffer/getBuffer) for getting disk usage rates.
    • "GetBuffer" timer for understanding "get time from blob storage".
    • "PutBuffer" timer (also for multipart) for understanding "put time from blob storage"
  • Some kind of "Compaction Job timer"
    • Can use this to get an idea of how long compaction jobs take.
    • Can also use this to get a count of compaction jobs.
    • Realize it's a slightly lower level detail though have seen similar on other dashboards and it's pretty large source of memory usage by XTDB
  • Cache hit/miss metrics (want these split by cache)
    • Might be provided these by Caffeine itself.
    • Arguable if should be in the monitoring board they ARE somewhat important for performance and show up in a lot of other examples.
  • Active/Total Connections metrics.
    • May be a bit bespoke we mark these on Jetty/Pgwire connections.

Previously done work

For each of the above, particularly for Queries, Transactions, JVM metrics and overall health, we will want to look at prior art/pre existing dashboards for inspiration of what things people want to see/understand when running their database application.

  • [x] Look at prior art for monitoring databases.
  • [x] #3871
  • [x] Gathering application logs to Grafana? (Perhaps using Loki) - @Akeboshiwind
  • [x] #3882

TODOs:

  • [x] #3896
  • [ ] #3897
  • Adding/supporting new gauges & visualizations to the cluster monitoring dashboard:
    • [x] #3898
    • [x] #3899
    • [x] #3900
    • [x] #3901
    • [ ] Add monitoring of Cache hit/miss metrics, (split by cache).
    • [ ] Add monitoring of the Compaction process

danmason avatar Nov 14 '24 16:11 danmason

We can make a separate card for this, but as well as our monitoring dashboard I believe we should have a separate debugging dashboard (based on what currently have). Previous thinking of contents there would include:

  • Queries (Read)
    • (timers, counts, rates, etc)
    • On top of this, some ability to find/save/examine what the slow running queries are
  • Transactions (write)
    • (timers, counts, rates, etc)
  • JVM metrics.
  • Direct memory usage graphs.
  • Disk usage metrics (for all non memory-only bufferpools)
  • Buffer pool ops.
  • Indexer ops.
  • Compactor ops.

danmason avatar Nov 18 '24 15:11 danmason

Just noting for later reference, micrometer has a number of useful other MeterBinder we could use: https://www.javadoc.io/doc/io.micrometer/micrometer-core/1.1.0/io/micrometer/core/instrument/binder/MeterBinder.html

Of particular interest:

  • CaffeineCacheMetrics - for cache hits/misses, other metrics?
  • DiskSpaceMetrics - potentially useful for measuring local disk cache usage?
  • JettyStatisticsMetrics - for measuring HTTP connections et al?
    • Probably better with smething bespoke since "connections" are shared between HTTP and PGWire?

danmason avatar Nov 18 '24 17:11 danmason