datafusion
datafusion copied to clipboard
Using `date_bin` with a time zone in a time range that contains daylight savings does not work
Describe the bug
When using the date_bin
function, while providing a time zone to the column used in date_bin
, and using a range that contains daylight savings (i.e. part of the range would be for example with an offset of +01:00
, and another part with an offset of +02:00
), the querying fails with the following error:
rpc error: code = InvalidArgument desc = External error: Arrow error: Cast error: Cannot cast timezone to different timezone
The query that I used is:
select
date_bin(interval '1 hour', time at time zone 'Europe/Brussels')
from
raw_data
where
time >= '2021-03-27T22:00:00.000Z'
and
time <= '2021-03-29T00:00:00.000Z'
group by date_bin(interval '1 hour', time at time zone 'Europe/Brussels')
Notice that the interval used [2021-03-27T22:00:00.000Z, 2021-03-29T00:00:00.000Z]
contains daylight savings within the range, meaning that in this case part of the range will be with an offset of +01:00
(before the daylight savings), and another part will be with an offset of +02:00
(after the daylight savings)
~~I suspect the issue is that Arrow is attempting to case the times with a 2 hours offset to the times of an 1 hour offset~~
Update: The issue is the daylight savings hour. This query works fine:
select
date_bin(interval '1 hour', time at time zone 'Europe/Brussels')
from
raw_data
where
time >= '2021-03-27T22:00:00.000Z'
and
time <= '2021-03-29T00:00:00.000Z'
-- exclude the daylight savings hour
and time not between '2021-03-28T02:00:00Z' and '2021-03-28T03:00:00Z'
group by date_bin(interval '1 hour', time at time zone 'Europe/Brussels')
In the query above, we exclude the daylight savings hour of that year for that timezone.
To Reproduce
Execute the following query
select
date_bin(interval '1 hour', time at time zone 'Europe/Brussels')
from
raw_data
where
time >= '2021-03-27T22:00:00.000Z'
and
time <= '2021-03-29T00:00:00.000Z'
group by date_bin(interval '1 hour', time at time zone 'Europe/Brussels')
Expected behavior
I should get the data aggregated by time. Part of the interval will be with a 2 hours offset, another part will be with a 1 hour offset
Additional context
raw_data
is a table that contains values for each 15 minutes. The table with its data is stored in influxDB 3.0 that uses the FDAP stack
No response
I think the wider image of this issue is the fact that the hour at which the daylight savings occur will not be able to get transformed to the desired timezone. For example:
select '2019-03-31T02:00:00Z'::timestamp at time zone 'Europe/Brussels';
This would also fail for the same reason
I remember the daylight savings time is quite tricky because it has ambiguous possible time.
See https://github.com/apache/datafusion/issues/8899#issuecomment-1900598499
I am testing some things out with the at time zone
operator to understand it better. I think the UTC timestamp casting should be explictly specified first. Some context:
Daylight savings in 2019 is in March 31 at 2AM in local time.
This is 2019-03-31T00:00:00
in UTC I think.
The query select '2019-03-31T02:00:00Z'::timestamp at time zone 'Europe/Brussels';
seems to be ignoring the Z
at the end, and it considers the timestamp local time.
Some experiments:
select '2019-03-31T02:00:00Z'::timestamp at time zone 'Europe/Brussels';
result> Arrow error: Cast error: Cannot cast timezone to different timezone
select '2019-03-31T00:00:00'::timestamp at time zone 'UTC' at time zone 'Europe/Brussels'
result> 2019-03-31T01:00:00+01:00
select '2019-03-31T01:00:00'::timestamp at time zone 'UTC' at time zone 'Europe/Brussels'
result> 2019-03-31T03:00:00+02:00
It seems to me that, regardless of the Z
provided in the query, the timestamp will considered as a naive date time, and the additional timezone information that is provided in the query will be considered the source of truth.