influxdb
influxdb copied to clipboard
`range()` does not filter correctly depending on year
Steps to reproduce: List the minimal actions needed to reproduce the behaviour.
I have a dataset where each point is years apart, the bucket was created in Version 2.0.7 with a shard group duration of 1000w. I am now on version 2.7.5
- The following query works as expected:
rx = /(Apfel)$/
from(bucket: "common_two_grams")
|> range(start: 1971-01-01, stop: 2020-01-01)
|> filter(fn: (r) => r["_measurement"] == "frequency")
|> filter(fn: (r) => r["_field"] == "match_count")
|> filter(fn: (r) => r["w1"] =~ rx or r["w2"] =~ rx)
|> group(columns: ["_time"])
|> sum()
|> group()
- The following query does not work as expected, notice that only the start year changed from 1971 to 1970.
rx = /(Apfel)$/
from(bucket: "common_two_grams")
|> range(start: 1970-01-01, stop: 2020-01-01)
|> filter(fn: (r) => r["_measurement"] == "frequency")
|> filter(fn: (r) => r["_field"] == "match_count")
|> filter(fn: (r) => r["w1"] =~ rx or r["w2"] =~ rx)
|> group(columns: ["_time"])
|> sum()
|> group()
Expected behaviour: This is the correct data I am expecting for the first query:
#group,false,false,false,false
#datatype,string,long,dateTime:RFC3339,unsignedLong
#default,_result,,,
,result,table,_time,_value
,,0,1971-11-29T23:00:00Z,9940
,,0,1972-11-29T23:00:00Z,9080
[....]
,,0,2018-11-29T23:00:00Z,69394
Actual behaviour: The second query does not filter the time correctly, as the complete time series is now in the dataset. Notice how it begins with 1725
#group,false,false,false,false
#datatype,string,long,dateTime:RFC3339,unsignedLong
#default,_result,,,
,result,table,_time,_value
,,0,1725-11-29T23:06:32Z,42
,,0,1726-11-29T23:06:32Z,28
,,0,1727-11-29T23:06:32Z,71
,,0,1728-11-29T23:06:32Z,49
,,0,1729-11-29T23:06:32Z,81
[...]
,,0,2015-11-29T23:00:00Z,87953
,,0,2016-11-29T23:00:00Z,90141
,,0,2017-11-29T23:00:00Z,74672
,,0,2018-11-29T23:00:00Z,69394
When adding aggregateWindow
into the query, the range seems to work again:
rx = /(Apfel)$/
from(bucket: "common_two_grams")
|> range(start: 1970-01-01, stop: 2020-01-01)
|> filter(fn: (r) => r["_measurement"] == "frequency")
|> filter(fn: (r) => r["_field"] == "match_count")
|> filter(fn: (r) => r["w1"] =~ rx or r["w2"] =~ rx)
|> aggregateWindow(every: 1y, fn: last)
|> group(columns: ["_time"])
|> sum()
|> group()
Environment info:
Running inside a docker container with version 2.7.5
I can confirm the behavior in 2.7. I believe it is related to that unix timestamp of zero is 1970-01-01T00:00:00Z. Timestamps before that are negative and it appears that is not handled correctly in the flux query (or execution).
If it is an issue with flux, you might be able to work around it with influxql. (I checked - influxql does better.)
We would welcome a pr fixing the issue if you are willing.
I just checked if its an issue with flux by modifying the range_test.flux
and it does not seem to be the case.
package universe_test
import "testing"
import "csv"
option now = () => 2030-01-01T00:00:00Z
inData =
"
#datatype,string,long,dateTime:RFC3339,double,string,string,string,string
#group,false,false,false,false,true,true,true,true
#default,_result,,,,,,,
,result,table,_time,_value,_field,_measurement,cpu,host
,,0,1701-01-01T00:00:01Z,0,usage_guest,cpu,cpu-total,host.local
,,0,1969-01-01T00:00:01Z,0,usage_guest,cpu,cpu-total,host.local
,,0,1970-01-01T00:00:01Z,0,usage_guest,cpu,cpu-total,host.local
,,0,1971-01-01T00:00:01Z,0,usage_guest,cpu,cpu-total,host.local
"
outData =
"
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,double,string,string,string,string
#group,false,false,true,true,false,false,true,true,true,true
#default,_result,,,,,,,,,
,result,table,_start,_stop,_time,_value,_field,_measurement,cpu,host
,,0,1970-01-01T00:00:00Z,2030-01-01T00:00:00Z,1970-01-01T00:00:01Z,0,usage_guest,cpu,cpu-total,host.local
,,0,1970-01-01T00:00:00Z,2030-01-01T00:00:00Z,1971-01-01T00:00:01Z,0,usage_guest,cpu,cpu-total,host.local
"
testcase range {
got =
csv.from(csv: inData)
|> testing.load()
|> range(start: 1970-01-01)
want = csv.from(csv: outData)
testing.diff(got, want)
}
Hi @philjb, as far as I understand, flux is doing the filtering, I found the go code for that in the flux repository. Looking at the test from my last comment tells me maybe there is something wrong in the database and not the query language.
Can you give me a general direction where to search for the possible issue in the influxdb repository?
Best regards Paul
I'd recommend reducing your flux query to the smallest query that shows the behavior - for example, is the aggregate window essential to reproducing it? Then I'd look at the implementations of the storage interface type Store interface
. All the requests include a Range *TimestampRange
which stores the query time range. I suspect there is special handling (that is "wrong") when the range start time is zero.
Quick update, I just used the FluxQL to select data between 1820 and 1830 and this works fine.
Hi @philjb, I think I found the issue. My only problem now is that Im unable to build the project because of dependencies from the flux repository. Mainly because the https://github.com/SAP/go-hdb v.0 does not seem to exist anymore.
I have a separate Issue already opened: https://github.com/influxdata/flux/issues/5462. Sadly I cant verify if my fix is correct.
I could verify the issue by running the binary on my local machine. I'm not sure as to why this check was done, but it seems to be the cause of the issue.
@paulheg - were you able to build a binary to test with locally? I can't tell if you were able to build based on your last comment.
Can you add a test for the scenario to your pr?
@philjb I was able to build it locally, run it, and verify.
Just found a work around:
GOPROXY=https://proxy.golang.org/cached-only go mod download github.com/SAP/[email protected]
I'll have a look, but I can't think of a useful test for this scenario right now. To check that the value I passed does not get modified?
Can you add a test for the scenario to your pr?
There is now a test in the pr.