kairosdb icon indicating copy to clipboard operation
kairosdb copied to clipboard

Reuse cache files within single query operation

Open gabrielreid opened this issue 5 years ago • 11 comments

Take advantage of the fact that SearchResults are written to a local file to reuse the SearchResult within a single REST query (consisting of multiple QueryMetrics) for improved performance and correctness.

The general use case is that multiple aggregations would be queried over the same timeline data. For example, consider that min, max, mean, p50, p95, and p99 are requested for a given metric over the past week grouped by hour.

Before this PR, such a query would perform the same query six times on the underlying datastore. This is sub-optimal for performance, but it can also be considered somewhat incorrect, as it doesn't give a consistent view of the underlying data (for example, new data points could come in between the queries for mean and p50, meaning that min, max and mean would be calculated based on different data that p50, p95, and p99).

After this commit, the above correctness issue is resolved, as well as improving the performance of the above query close to 6x (in the case where there are many data points involved and querying the underlying data store is the blocking factor).

gabrielreid avatar Oct 25 '18 07:10 gabrielreid

kairos already does this. Take a look at http://kairosdb.github.io/docs/build/html/restapi/QueryMetrics.html and search for cache_time.

Basically if you have it set on a query it will reuse previous searches that match the same time range of data and it does exactly what you are after.

brianhks avatar Oct 01 '19 18:10 brianhks

@brianhks I believe I understand what @gabrielreid was trying to fix as I have the same use-case and potentially the same problem today. I haven't tried out his change though.

The local cache works well for me IF a query for the data has already completed. They must be issued sequentially. For example: 00:00:00 Query A: sum of data for timeframe X 00:00:04 Query A completes in 4 seconds 00:00:06 Query B: avg of data for timeframe X using cache_time of say 1 minute 00:00:07 Query B completes in 1 second because it used the local cache

The key point was that query A completes, and then query B is issued to KairosDB. If I do a single JSON query asking for both at the same time, they take 4+4 seconds to complete.

What I want to be able to do is this. Note the tags and name are exactly the same, just different aggregation.

   "metrics": [
       {
           "tags": {
               "host": ["foo"],
               "customer": ["bar"]
           },
           "name": "abc.123",
           "aggregators": [
               {
                   "name": "sum",
                   "sampling": {
                       "value": 10,
                       "unit": "minutes"
                   }
               }
           ]
       },
       {
           "tags": {
               "host": ["foo"],
               "customer": ["bar"]
           },
           "name": "abc.123",
           "aggregators": [
               {
                   "name": "avg",
                   "sampling": {
                       "value": 10,
                       "unit": "minutes"
                   }
               }
           ]
       }
   ]

In my testing, if I do the above JSON there is no performance benefit from the local cache. But if I split it up into two sequential queries one after another, then the second query (and third, fourth etc) is much faster.

rdzimmer-zz avatar Oct 01 '19 19:10 rdzimmer-zz

What you described is exactly what I designed this for. What you are missing is the cache_time parameter on the query. Queries all done in a single request are done sequentially so your second will benefit from the cache if you set a cache_time. I've tested exactly what you are doing before and it did work at one time, it could be broken but I doubt it.

brianhks avatar Oct 01 '19 19:10 brianhks

Sorry, I just modified the metric part of http://kairosdb.github.io/docs/build/html/restapi/QueryMetrics.html to be an example of what I was testing. I left off the bottom part of the query. Looking back at my test cases I see I was using:

  "plugins": [],
  "cache_time": 60,
  "start_relative": {
    "value": "48",
    "unit": "hours"
  }

I definitely see the benefit when I split up the metrics requests to run sequentially as two separate posts, so the cache_time flag is working there.
I will go back and revisit the tests as soon as I'm able to double check things. Any logging or tracing that might be able to help verify what it's doing? I'm on the latest release.

The only difference I see between the example I posted above and my actual test data is I have no tags in my request. "tags": {}, I wonder if that could be exercising a code path that breaks it?

rdzimmer-zz avatar Oct 01 '19 20:10 rdzimmer-zz

To clarify, there are some tags in the test data, just not in the queries. For example, tags CPU 0, 1, 2, 3 turning into avg for all 4 CPUs.

rdzimmer-zz avatar Oct 01 '19 20:10 rdzimmer-zz

OK now I remember. People have had issues with the cache growing out of control and most don't utilize it so it was turned off by default in recent release. So you need to set kairosdb.query_cache.keep_cache_files: true

Then it will work. The speedup you are seeing is actually C* just caching the results, but I guarantee if you set cache_time and turn on the cache files it will be a lot faster.

brianhks avatar Oct 01 '19 21:10 brianhks

Just to clarify, the underlying intention of this PR was twofold:

  • to make use of the cache file storage that is being done regardless of caching settings and cache_time (i.e. run queries more efficiently without requiring non-default settings or request parameters to be used)
  • to provide a consistent view of the underlying data within the scope of a single query (for the case where multiple aggregations are being performed on the same set of data within the scope of a single query)

I understand the underlying mechanism of the turning on query_cache.keep_cache_files and providing a cache_time parameter in queries, but I see the general intention of this PR as being something different.

gabrielreid avatar Oct 02 '19 06:10 gabrielreid

Yes, I was setting kairosdb.query_cache.keep_cache_files: true. I was watching the /tmp/kairos_cache/ (default) dir and confirmed the files were being created. I did some tests both with and without the cache_time in the query, and I could see a difference in performance compared to the expected Cassandra speedup when the data is fresh in its memory. It was actually just a few weeks ago I was doing this, but let me go back and double check everything.

@gabrielreid make use of the cache file storage that is being done regardless of caching settings and cache_time that's exactly what I was hoping could be done! within the scope of a single query being the key classifier.

So I think there are 2 separate things:

  1. Allowing "caching" to help within the scope of a single query (single json post with multiple aggs) regardless of cache settings. That sounds like what this pull request handles
  2. Why I was not seeing a benefit within a single query when I had the caching enabled (both in the KairosDB properties and the query cache_time flag). I did not open an issue since I did get a benefit from sequential queries, and I assumed it was WAD. IF I can recreate, I'll open a separate issue for tracking that, since that is not the same as this pull request.

rdzimmer-zz avatar Oct 02 '19 13:10 rdzimmer-zz

Sorry, based on what I'm seeing now you're right @brianhks. Not sure what I was doing wrong before to make me think that it wasn't working that way, maybe I didn't understand/know what to expect.

However, I do think that the pull request should still be considered. Within the scope of a single query, using the cache makes sense to me, regardless of cache settings. At least for my personal use-case, that's the only time I'd use the cache, so I don't want to have to worry about the files and pruning.

rdzimmer-zz avatar Oct 02 '19 20:10 rdzimmer-zz

Yes I agree within the context of a single query it makes the most sense. The other I had thought of is if you have several tv's showing the same dashboards they can use the cache_time to reduce load on the cluster. I like the idea, just not as comfortable with how it was done. I'll need to think it over a bit more.

brianhks avatar Oct 02 '19 22:10 brianhks

For the historical record, here's the test I did.
I generated data for metrics with 1000 tag value combinations each for several hours. Maybe not the most realistic combination, but for cache testing purposes it works (generated a lot of data quickly).
I then did queries for the last 4 hours of data for a metric, with 10 individual aggregations in the single JSON post. So each was working against 240,000 datapoints. These were averaging 4-5 seconds to return. I did do them for a while, so Cassandra caching would also be involved in improving the non-cached results. I was using a 10 second cache time to make sure each time through the loop had to repopulate the cache.
I then did the same 10 aggregation requests one after another in individual calls to KairosDB (but still with caching). The response times looked like this:

0.922
0.392
0.353
0.410
0.386
0.385
0.883
0.459
0.392
0.412

For a total of 5.012. The first request was always around 1 second, with the others usually in the 0.4 second range. So IF it wasn't working, the single request to KairosDB should take 1x10 seconds. But it was taking 4-5 seconds, which mirrors the 10 individual requests times. So all good! Sorry again for the scare, not sure what I did wrong before!

rdzimmer-zz avatar Oct 03 '19 14:10 rdzimmer-zz