influxdb icon indicating copy to clipboard operation
influxdb copied to clipboard

fill should force a return of values, even if the query has none

Open beckettsean opened this issue 7 years ago • 39 comments

Feature Request

Could be considered a bug, really a matter of perspective.

Proposal: [Description of the feature] If there is a fill() clause attached to the query, the query should return all buckets specified, even if there are no matching points.

Current behavior: [What currently happens] If the query has nothing but null results, the fill() clause is ignored.

Simple measurement with one point from yesterday:

> select * from fillfoo
name: fillfoo
-------------
time            value
2016-07-05T20:20:00Z    42

> 

Selecting a SUM with a GROUP BY time() and fill() clauses returns all expected buckets, with the fill() clause applied to buckets with no data:

> select sum(value) from fillfoo where time > now() - 21h group by time(4h) fill(0)
name: fillfoo
-------------
time            sum
2016-07-05T16:00:00Z    0
2016-07-05T20:00:00Z    42
2016-07-06T00:00:00Z    0
2016-07-06T04:00:00Z    0
2016-07-06T08:00:00Z    0
2016-07-06T12:00:00Z    0
2016-07-06T16:00:00Z    0

However, if every bucket has no data, fill() is not invoked.

> select sum(value) from fillfoo where time > now() - 20h group by time(4h) fill(0)
> 

Desired behavior: [What you would like to happen] fill() should be used to populate all buckets that don't otherwise have data, including when all buckets are null.

> select sum(value) from fillfoo where time > now() - 20h group by time(4h) fill(0)
name: fillfoo
-------------
time            sum
2016-07-06T00:00:00Z    0
2016-07-06T04:00:00Z    0
2016-07-06T08:00:00Z    0
2016-07-06T12:00:00Z    0
2016-07-06T16:00:00Z    0

Use case: [Why is this important (helps with prioritizing requests)]

This is not expected behavior and leads to user confusion when they expect fill() to always be applied:

https://groups.google.com/d/msgid/influxdb/6b233752-edd2-4523-9022-5c83dbcae344%40googlegroups.com https://github.com/influxdata/influxdb/issues/6412 https://github.com/influxdata/influxdb/issues/6953

beckettsean avatar Jul 06 '16 16:07 beckettsean

Whenever this comes up, the problem is that we don't know what series to return. In the use case given by the example, fill seems intuitive because you have no tags so there is only one series. In most use cases, there's more than one series. We can always use the index to determine what series exist, but then fill() may return filled in values for series that are long since dead. If you use AWS autoscaling, Mesos containers, Docker, etc. you might have many different series for all of the different instances you have. If you only want the last five minutes, should we fill in values for the last 5 minutes on a series that hasn't been relevant in a year? This seem unintuitive, but we would be required to do that for this to work. If we only use series that have a value within the last 5 minutes, you get the current behavior.

Until there is a way to determine what the user wants, I'm not sure there's anything we can do for this. My ears are open if there's some good way to do this because the current workflow that's missing is necessary, but I just haven't been able to come up with anything.

jsternberg avatar Jul 06 '16 18:07 jsternberg

@jsternberg thanks for reminding us of the tags issue. Obviously if we could return all the tags that the user intends, even though there are no tags in the interval, that would be ideal. However, InfluxDB cannot (yet) read minds, so it's also impractical.

That leaves us with two choices, as I see it:

  1. return nothing
  2. return buckets for a single series

I suspect that the latter is less surprising to most people. At least it returns something, rather than nothing. It may not be the entirety of what is desired, but it does respect the intent of fill().

Is there a reason we can't just return one series, using only the measurement name and with no tag set, but respecting the fill() parameters?

beckettsean avatar Jul 06 '16 18:07 beckettsean

I think the hard part would be determining which series to use. If they have a condition and it matches two series, should we output those or return nothing? This also becomes more complicated when you add multiple tags. I can constrain the query to only look at one tag through a WHERE condition, but maybe there's another tag that is contained within that series. I guess we could try allowing it only if there's one series that matches the conditions, but it seems like it might be unintuitive. I'm also not 100% certain the code is in a way where we can physically do that so I'm not promising anything yet.

If either of these is worth exploring though, we can add it as an exploratory option for 1.1 and see what it looks like to determine if it fits our users' needs. I just have a lingering concern that it will create an extra edge case that will be confusing to explain.

jsternberg avatar Jul 06 '16 18:07 jsternberg

@jsternberg I'm advocating for not even trying to determine the series. In essence, execute the query. If there are no buckets in the return, artificially create them with the fill() condition for the series with no tags. Just return it with measurement only as the series.

E.g.

> select max(value) from fillfill where time > now() - 10h group by time(1h), * fill(7)
name: fillfill
tags: tag1=foo, tag2=bar
time            max
----            ---
2016-07-06T09:00:00Z    7
2016-07-06T10:00:00Z    42
2016-07-06T11:00:00Z    7
2016-07-06T12:00:00Z    7
2016-07-06T13:00:00Z    7
2016-07-06T14:00:00Z    7
2016-07-06T15:00:00Z    7
2016-07-06T16:00:00Z    7
2016-07-06T17:00:00Z    7
2016-07-06T18:00:00Z    7
2016-07-06T19:00:00Z    7

> select max(value) from fillfill where time > now() - 8h group by time(1h), * fill(7)
> 

becomes

> select max(value) from fillfill where time > now() - 10h group by time(1h), * fill(7)
name: fillfill
tags: tag1=foo, tag2=bar
time            max
----            ---
2016-07-06T09:00:00Z    7
2016-07-06T10:00:00Z    42
2016-07-06T11:00:00Z    7
2016-07-06T12:00:00Z    7
2016-07-06T13:00:00Z    7
2016-07-06T14:00:00Z    7
2016-07-06T15:00:00Z    7
2016-07-06T16:00:00Z    7
2016-07-06T17:00:00Z    7
2016-07-06T18:00:00Z    7
2016-07-06T19:00:00Z    7

> select max(value) from fillfill where time > now() - 8h group by time(1h), * fill(7)
name: fillfill
time            max
----            ---
2016-07-06T11:00:00Z    7
2016-07-06T12:00:00Z    7
2016-07-06T13:00:00Z    7
2016-07-06T14:00:00Z    7
2016-07-06T15:00:00Z    7
2016-07-06T16:00:00Z    7
2016-07-06T17:00:00Z    7
2016-07-06T18:00:00Z    7
2016-07-06T19:00:00Z    7

beckettsean avatar Jul 06 '16 19:07 beckettsean

:+1:

enumag avatar Aug 23 '16 12:08 enumag

jsternberg: I guess we could try allowing it only if there's one series that matches the conditions, but it seems like it might be unintuitive.

 

beckettsean: I'm advocating for not even trying to determine the series. In essence, execute the query. If there are no buckets in the return, artificially create them with the fill() condition for the series with no tags. Just return it with measurement only as the series.

 

I don't don't view these 2 ideas as exclusive of each other. In fact I think they support each other, and is completely intuitive. The way I look at it, there is always at least one series. When you GROUP BY something, a grouping is all the available permutations of the different fields in the GROUP BY clause. If all fields are null (such as when they don't exist because there's no data), that's still one permutation (a.k.a. a bucket), thus you can do a fill() on the selected column for that permutation. And in our case, we will always have a minimum of one non-null field, the time field, as it's artificially created by influxdb.

phemmer avatar Sep 03 '16 17:09 phemmer

Hm, I don't necessarily think so, but I want to explore the idea a bit more before saying for certain. The first thing is I think you may have some misunderstandings or using the wrong words so I'm just going to clarify a few things so that we can be on the same page.

Fields aren't a part of a series. Tags are a part of a series. GROUP BY doesn't operate on fields. There is also no time field except one that is implicitly created from a field being written. If there are no fields, there is no time since we don't have a record of any data being inserted at that time. Each field is put into its own columnar section and they are merged when multiple fields are requested. This is useful for selecting a single field since you don't have to decode all of the fields at a certain time to get one field.

But I do think you bring up a good point. I just don't think it works when I try to make it more generic to all circumstances. If we assume that a measurement always has at least one series (the one where all tags are empty), this works well when you don't use GROUP BY tagkey. Imagine that I have one point written.

cpu,host=server01 value=2 10000

If I use the following query:

SELECT sum(value) FROM cpu WHERE time >= now() - 30m GROUP BY time(5m), host

Does this have one or two series? I would say that this has one series (cpu,host=server01), but if we always assume there is at least one series that has an empty value for every tag key, we would get two series (cpu and cpu,host=server01) and would have to return values for both. I guess we could say, "If there are any values for this GROUP BY tagkey and there are none for the empty, drop the series with an empty tag key", but then we're starting to get to special exceptions in behavior which is the part I find unintuitive.

Is this the behavior you mean? If this is what you mean, I'll think about it a little more. This may be possible, but I need to think if there are any other mitigating circumstances and if the special exception for when there are no series is worth it.

jsternberg avatar Sep 03 '16 17:09 jsternberg

I am also running into the same issue - trying to use FILL(0) on a time range that has no data. I was expecting to have all buckets with a value of 0, but instead I get nothing.

As far as the issue that @jsternberg is bringing up, to me it would seem intuitive to still return the buckets, albeit with no tags i.e.:

cpu,host=server01 value=2 10000
SELECT sum(value) FROM cpu WHERE time >= now() - 30m GROUP BY time(5m), host

If there is data in this time range with a "host" tag, then we can safely assume a single series with host=server01, and we can fill each bucket accordingly. This is the current normal behavior.

If there is no data in this time range, then there are no values for the "host" tag, so we can safely assume a single series with host=null and FILL each bucket accordingly.

I guess we could say, "If there are any values for this GROUP BY tagkey and there are none for the empty, drop the series with an empty tag key", but then we're starting to get to special exceptions in behavior which is the part I find unintuitive.

I would say you wouldn't drop the series with the empty tag key, rather you wouldn't create it to begin with. You would only create the empty tag series if the query turned up no results. It seems pretty straightforward and intuitive to me. Just my 2 cents.

cnelissen avatar Oct 06 '16 21:10 cnelissen

Same problem here. Fixing this would help a lot.

jimaek avatar Oct 19 '16 12:10 jimaek

Hi guys, there are some news on that?

We are collecting metrics from SNMP Devices most of error related metrics are permanently '0': i.e: dot3Stats, ifInErrors/Discards.

In order to save a lot of disk space we implemented a filter that it only sends and store data to InfluxDB when the value is different than zero on same measurement, and always report a nonzero metric on measurement to let Influx create those field names.

The expected behaviour after that implementation is that InfluxDB would fill those null values with '0' with the fill(0) statement but it is explained comments above, when there is no data it returns nothing.

sbengo avatar Jan 25 '17 13:01 sbengo

Hi.

Same question here? Is it something which we need as our alert in kapacitor never closed due to lacking point in query detecting failed. The join() in kapacitor is not executed and the alert is never closed.

Rgds.

AlexGrs avatar Feb 13 '17 15:02 AlexGrs

We have no progress on this. If you have any kind of idea that gets around the problem listed earlier in this thread for why we can't do this, we can debate if it works or not.

jsternberg avatar Feb 13 '17 15:02 jsternberg

We are having the same issue. Our trades measurement, which include individual stock trades with trade size/volume and price (etc), doesn't have entries for all time periods we query:

select
    sum(volume) as trade_volume,
    count(price) as trade_count
from trades
where
    instrument_ticker = '%s' and exchange_ticker = '%s' and
    '%s' <= time and time < '%s'
group by time(%s) fill(0);

(Here instrument_ticker and exchange_ticker are tags)

We expect trade_volume and trade_count to be filled with zero for time bins/bars which have no trades (say every 5 minute "bar"). This works fine when there is at least one datapoint in the queried period. However, if the instrument and day we're looking at for some reason doesn't have any trades this returns no entries. This is just plain unintuitive and pushes the client side application to have to account for it. As it stands we now have to implement a special check for when influx doesn't return something in our client, such that we can emulate that influx returned something.

I understand that there is a generalization issues, as described above, however, it should be solved somehow.

In terms I'm familiar with, my understanding of the issue is as follows: If in my query above, I didn't include the tag instrument_ticker, thus asking for many instruments in one exchange, what should it return for instruments which are long gone and closed shop, i.e., doesn't have any trades anymore?

An example of this could be as follows:

select sum(volume) as trade_volume
from trades 
where 
    exchange_ticker = '%s' and 
    '%s' <= time and time < '%s' 
group by instrument_ticker, time(%s) fill(0)

This produces output for plenty of instruments, but only those with at least one datapoint (i.e., trade) in the period.

My suggestion is to add a flag 'missing_option' to the fill syntax:

(Scroll right.)

 SELECT <function>(<field_key>) FROM_clause WHERE <time_range> GROUP BY time(<time_interval>[,<offset_interval>])[,<tag_key>] [fill(<fill_option>[,<missing_option>])]

(edit: Most of the above syntax was taken from documentation. The time(...) and tag_key parts seems to be wrong. Tried to fix it here, but should probably be updated in docs. In any case, the most right part is what I am suggesting.)

The 'missing_option' takes either the value 'ignore' or 'keep'. 'ignore' is the default, using the current solution where tag values which aren't represented with datapoints are just ignored. 'keep' on the other hand spits out filled values for all tag values (in my case instruments) which have ever been produced. We will client side handle filtering of the ones we don't care about.

Example usage:

select sum(volume) as trade_volume
from trades 
where 
    exchange_ticker = '%s' and 
    '%s' <= time and time < '%s' 
group by instrument_ticker, time(%s) fill(0, keep)

Also notice that the 'missing_option' is optional and defaults to the current behavior, thus this would be backward compatible.

Anyway. Those are my two cents. Whatever is done, something should be done with this.

AndreCAndersen avatar Mar 28 '17 13:03 AndreCAndersen

I'd also like to see this fixed. AndreCAndersen's suggestion seems like a good solution

IvorNL avatar May 09 '17 09:05 IvorNL

I'd be happy either of the suggestions from @beckettsean or @AndreCAndersen. 👍 I'm running into many situations where reliable results can't be calculated (easily) because of this. This seems most obvious with count() queries, where you'll never get a 0 back for a single time range query, and only get 0's back in a data series if some elements have data to count. I think if #6412 weren't locked from all the +1 silliness, it would show a lot of support. But allowing fill() to always fill values would fix things and do what (I think) most people expect.

Codelica avatar Jun 09 '17 17:06 Codelica

What about this fix? #996 Was it broken?

dev-e avatar Apr 13 '18 11:04 dev-e

I haven't tested, but seeing as it is a fix from 2014, and we're talking about a reproducible issue from 2017 I'm sure it was not fixed. From our puppet file it seems our version is at least v1.0.0 while the reported fix affected v0.8.7.

AndreCAndersen avatar Apr 17 '18 18:04 AndreCAndersen

I agree with @AndreCAndersen solution with an additional possible value for the fill() missing_option argument.

I propose something like this:

fill(<fill_option>[,<missing_option>])

Where <missing_option> is one of:

  • 'all' - fills values for all tags that ever existed (as described by @AndreCAndersen above)
  • time.Duration - similar to 'all' but only includes tags that have been created within some time interval. So if you passed 24h here, it would only fill values for tags that existed within the last 24 hours.

This gives you complete control over fill behavior. If you want the current behavior that only fills tags that exist in the current window, you can omit the missing_option argument. If you want to fill values for all tags that ever existed (which seems like what people are asking for) pass the 'all' option. If you want to control which tags are included based on the last time they have a value (to prevent unused tags from being filled) pass a time.Duration to set a bounds for which tags to include.

narciero avatar Aug 23 '18 21:08 narciero

I am facing the same issue. Is there any update on this?

Avtarsingh127 avatar Dec 19 '18 18:12 Avtarsingh127

I'm very sorry to break the github-quette writing a 'me too' reply but this is the case. Are there chances to see any of the proposal discussed here implemented ?

s17t avatar Mar 06 '19 06:03 s17t

Can we get this fixed? It's breaking math formulas if the sum is "nothing"/"null"

+1

jeffhoang avatar Mar 14 '19 02:03 jeffhoang

Please provide a fix, dear dev team. It is really breaking calculations and there is no workaround yet. Thanks in advance. +1

seb-koch avatar Mar 15 '19 18:03 seb-koch

+1 for this too

sotona- avatar May 30 '19 09:05 sotona-

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 28 '19 10:08 stale[bot]

Mr. Stale Bot, do not reap this issue. If you do, I will come for you.

Codelica avatar Aug 28 '19 15:08 Codelica

The solution suggested by @AndreCAndersen and @narciero should be the default. When using InfluxDB with some tools that don't allow editing of the data, this feature would come really handy.

icosac avatar Oct 14 '19 13:10 icosac

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 12 '20 13:01 stale[bot]

Mr. Stale Bot was warned. No reason to close this out. There isn't much to talk about, it just needs a fix.

Codelica avatar Jan 12 '20 16:01 Codelica

hi, you can also have this problem when you want to show a column who haven't data example : (host 10.10.10.41 is down)


> show tag values with key = url
name: ping
key value
--- -----
url 10.10.10.41
url www.google.com

> select * from "ping" WHERe time >= now() - 2h and url='10.10.10.41'
name: ping
time                 average_response_ms errors host     maximum_response_ms minimum_response_ms packets_received packets_transmitted percent_packet_loss percent_reply_loss reply_received result_code url
----                 ------------------- ------ ----     ------------------- ------------------- ---------------- ------------------- ------------------- ------------------ -------------- ----------- ---
2020-02-28T09:02:18Z                            host                                         0                1                   100                 100                0              0           10.10.10.41
2020-02-28T09:02:24Z                            host                                         0                1                   100                 100                0              0           10.10.10.41
2020-02-28T09:02:34Z                            host                                         0                1                   100                 100                0              0           10.10.10.41
2020-02-28T09:02:44Z                            host                                         0                1                   100                 100                0              0           10.10.10.41
2020-02-28T09:02:54Z                            host                                         0                1                   100                 100                0              0           10.10.10.41
2020-02-28T09:03:04Z                            host                                         0                1                   100                 100                0              0           10.10.10.41
2020-02-28T09:03:14Z                            host                                         0                1                   100                 100                0              0           10.10.10.41
2020-02-28T09:03:24Z                            host                                         0                1                   100                 100                0              0           10.10.10.41
2020-02-28T09:03:31Z                     100    host                                                                                                                                    2           10.10.10.41
>
> SELECT last("average_response_ms") FROM "ping" WHERE time >= now() - 2h GROUP BY time(30m), url fill(null)
name: ping
tags: url=www.google.com
time                 last
----                 ----
2020-02-28T08:30:00Z
2020-02-28T09:00:00Z 138
2020-02-28T09:30:00Z
2020-02-28T10:00:00Z
2020-02-28T10:30:00Z
>

i use "ping" on telegraf 10.10.10.41 appear in tag, have measurement (but not in all field) and so not appear a the end.....

depfryer avatar Feb 28 '20 11:02 depfryer

This is still an urgent problem that prevents the creation of alerts in Grafana based on lack of data.

M0rdecay avatar Jun 03 '20 12:06 M0rdecay