kairosdb
kairosdb copied to clipboard
Gaps aggregator queries can produce huge (GB-sized) cache files
Running KairosDB 1.1.1 I noticed that running a query with the gaps
aggregator and a small sampling interval (such as a millisecond) over a time-series with many data points (say over a week's worth of data or so with data points every 10 seconds) produces a GB-sized cache file which, eventually, consumes my entire disk, since once I have submitted the query there is no way of stopping it.
I used this python script to populate my database with 10 days worth of dummy data (one data point for every ten seconds).
After that I ran a query spanning across the entire time interval and using the gaps
aggregator with a small sampling interval:
{
"metrics": [
{
"tags": {},
"name": "my.metric",
"aggregators": [
{
"name": "gaps",
"align_sampling": true,
"sampling": {
"value": "1",
"unit": "milliseconds"
}
}
]
}
],
"cache_time": 0,
"start_absolute": 1456786800000,
"end_absolute": 1457564400000
}
While the query is being processed I watch the kairosdb cache file grow, and grow, and GROW:
ls -alh /tmp/kairos_cache/1457619772264/
total 5.2G
...
-rw-r--r-- 1 root root 5.2G Mar 10 14:39 kairos7222588542162754088.json
The attached thread dump shows what was going on within each respective thread of KairosDB.
I'm not sure if it's a proper bug (maybe this behavior is expected and/or I was using the aggregator in a stupid fashion), however the consequences of running such a query are quite severe, as the query can bring down the entire server.
Hi, Interesting case, why are you doing this?
The gaps aggregator creates one null sample per selected time period where there's no data.
With a period of 1ms and 10 days query you will ask kairosdb to return results with 860,000,000 data points, 99.99 % of them will be null.
This is why your cache file becomes so big. No surprise this takes a lot of time and is likely to kill kairosdb. No client will be able to receive and parse the json results in an acceptable period of time if it succeeds anyway
This aggregator is designed so drops of data can be plotted in the web ui of grafana. Nobody would ever want to plot 900M points on a screen.
What is your use case? Le 10 mars 2016 3:52 PM, "Peter Gardfjäll" [email protected] a écrit :
thread-dump.txt https://github.com/kairosdb/kairosdb/files/167340/thread-dump.txt
push-data.txt https://github.com/kairosdb/kairosdb/files/167345/push-data.txt
— Reply to this email directly or view it on GitHub https://github.com/kairosdb/kairosdb/issues/277#issuecomment-194884650.
I can't say that I have a real case for using the gaps aggregator in that fashion. I kind of discovered this issue out of coincidence while trying to find out how to write a query to produce a certain outcome.
I have also experienced similar behavior earlier (i.e., a KairosDB query consuming practically all my disk -- or at least it would have, hadn't I killed it), and although I can't say I'm 100% certain, I believe that those queries did not make use of the gaps
aggregator. So my thought is that maybe this issue is just a symptom of a bigger problem. Are there perhaps other aggregators that can produce similar disk-consuming black holes?
I understand that the sample query above doesn't use the gaps aggregator in a sensible fashion. But still, it bothers me that a single badly chosen query (be it malicious or unintended) can bring a server to its knees (when disk runs out all bets or off with respect to system behavior).
Could perhaps some kind of warning/prevention system be considered to reject potentially "dangerous" queries? For example, reject a query that would produce a number of datapoints higher than a certain (configurable) threshold (unless the client query includes a force: true
parameter or similar). I don't know if this is feasible with the current architecture, I'm just thinking out loud here.
What do you think?
I think the same. Therefore on our custom system we have added limits aggregators, one for number of points, another for the number of groups. They send an exception if the count is above X points or N groups, they can be customized or deactivated. I did not make any PR for that because they are very limiting and can become cumbersome, user need to deactivate them in large queries and that can also be frustrating. But that can be a solution for you as well.
Even if we did not share their source code, those aggregators are very simple to code.
It solves most possible issues of the kind. But it doesn't solve the fact that if you request too many points from Cassandra e.g. GBs to TBs of raw data, you may also bring the system down to his knees before aggregation can even occur.
Cordialement, Loic COULET
2016-03-11 8:36 GMT+01:00 Peter Gardfjäll [email protected]:
I can't say that I have a real case for using the gaps aggregator in that fashion. I kind of discovered this issue out of coincidence while trying to find out how to write a query to produce a certain outcome.
I have also experienced similar behavior earlier (i.e., a KairosDB query consuming practically all my disk -- or at least it would have, hadn't I killed it), and although I can't say I'm 100% certain, I believe that those queries did not make use of the gaps aggregator. So my thought is that maybe this issue is just a symptom of a bigger problem. Are there perhaps other aggregators that can produce similar disk-consuming black holes?
I understand that the sample query above doesn't use the gaps aggregator in a sensible fashion. But still, it bothers me that a single badly chosen query (be it malicious or unintended) can bring a server to its knees (when disk runs out all bets or off with respect to system behavior).
Could perhaps some kind of warning/prevention system be considered to reject potentially "dangerous" queries? For example, reject a query that would produce a number of datapoints higher than a certain (configurable) threshold (unless the client query includes a force: true parameter or similar). I don't know if this is feasible with the current architecture, I'm just thinking out loud here.
What do you think?
— Reply to this email directly or view it on GitHub https://github.com/kairosdb/kairosdb/issues/277#issuecomment-195234447.
I think there should be a server side configuration that puts limits on queries and kills them if they go too far. In large environments where many people are looking at the data you don't want one person killing the system with a dumb query.