metrics: Cache Performance
Cache Performance Metrics
Right now we serve little information on the new caches and their performance, so this ticket looks to add some basic metrics to track.
These names track with how we currently do Parquet cache tracking.
Update the /metrics endpoint to serve the following metrics:
New Last Value Cache Metrics:
- [ ] influxdb3_last_values_cache_response_duration_seconds_bucket: Time to complete LVC queries bucketed into 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 10, inf seconds, per database, per cache (https://github.com/influxdata/influxdb/pull/26388)
- [ ] influxdb3_last_values_cache_response_duration_seconds_sum: Total time spent waiting on response from LVC, per database, per cache (https://github.com/influxdata/influxdb/pull/26388)
- [ ] influxdb3_last_values_cache_response_duration_seconds_count: Amount of queries to the LVC , per database, per cache. (https://github.com/influxdata/influxdb/pull/26388)
- [ ] influxdb3_last_values_cache_size_bytes: Total size of LVC in bytes, per database, per cache
- [ ] influxdb3_last_values_cache_refresh_count: Total number of times the LVC has been refreshed
New Distinct Value Cache Metrics:
- [ ] influxdb3_distinct_values_cache_response_duration_seconds_bucket: Time to complete DVC queries bucketed into 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 10, inf seconds, per database, per cache
- [ ] influxdb3_distinct_values_cache_response_duration_seconds_sum: Total time spent waiting on response from DVC, per database, per cache
- [ ] influxdb3_distinct_values_cache_response_duration_seconds_count: Amount of queries to the DVC , per database, per cache.
- [ ] influxdb3_distinct_values_cache_size_bytes: Total size of DVC in bytes, per database, per cache
- [ ] influxdb3_distinct_values_cache_refresh_count: Total number of times the DVC has been refreshed
@peterbarnett03 can you clarify what the _refresh_count metric is referring to? Not sure what you mean, e.g., by, "Total number of times the LVC has been refreshed"
The number of times the LVC and DVC are updated. The point of it is that we can then assess the rate of refresh which will (assumingly) be useful for understanding performance and setup, plus how well the LVC/DVC are being used.
I see, so, if a write comes in and leads to a cache being updated, then we increment the counter.
I figure it would be helpful to also have the refresh metrics labelled by database.
I am hesitant to label metrics for individual caches. The cache name is unique within a table. So, if we want the metrics to be labelled at the cache level, there needs to be a database, table, and cache label. The cost of that could be non-trivial in a setup with thousands of caches, and would make the /metrics output explode.
I think for a starting point, we should only label by database, as we have done for ingest metrics.
Alright, I can get on board with that. If we need to expand over time we can, but it'll be a good starting point and will help us know how much usage of caches in general there is.
While working on https://github.com/influxdata/influxdb/pull/26388, I realized: I think it better to only track duration for successful queries. Failed queries to the cache will likely fail very quickly, so I don't think tracking their duration is meaningful.
Furthermore, queries to the LVC only fail if:
- user provides invalid query filter predicates
- some internal error from invalid catalog or database state occurs (if this happens, they likely have other problems)
Given that, I could add a influxdb3_last_values_cache_failed_queries_total if we want to track count of failures, but the majority of these would be due to user error (invalid filter predicates), so I'm not sure if its worth adding.
One might argue, if a user is looking at their dashboard and going, "for Pete's sake, why are there no LVC query metrics?" then a metric tracking failures would quickly indicate that their LVC queries are happening, but failing, vs. not happening at all 🤷