datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Using `date_bin` with a time zone in a time range that contains daylight savings does not work

Open Abdullahsab3 opened this issue 9 months ago • 3 comments

Describe the bug

When using the date_bin function, while providing a time zone to the column used in date_bin , and using a range that contains daylight savings (i.e. part of the range would be for example with an offset of +01:00, and another part with an offset of +02:00), the querying fails with the following error:

rpc error: code = InvalidArgument desc = External error: Arrow error: Cast error: Cannot cast timezone to different timezone

The query that I used is:

select 
    date_bin(interval '1 hour', time at time zone 'Europe/Brussels') 
from 
    raw_data 
where 
    time >= '2021-03-27T22:00:00.000Z' 
and 
    time <= '2021-03-29T00:00:00.000Z'
group by date_bin(interval '1 hour', time at time zone 'Europe/Brussels') 

Notice that the interval used [2021-03-27T22:00:00.000Z, 2021-03-29T00:00:00.000Z] contains daylight savings within the range, meaning that in this case part of the range will be with an offset of +01:00 (before the daylight savings), and another part will be with an offset of +02:00 (after the daylight savings)

~~I suspect the issue is that Arrow is attempting to case the times with a 2 hours offset to the times of an 1 hour offset~~

Update: The issue is the daylight savings hour. This query works fine:

select 
    date_bin(interval '1 hour', time at time zone 'Europe/Brussels') 
from 
    raw_data 
where 
    time >= '2021-03-27T22:00:00.000Z' 
and 
    time <= '2021-03-29T00:00:00.000Z'
-- exclude the daylight savings hour
and time not between '2021-03-28T02:00:00Z' and '2021-03-28T03:00:00Z'
group by date_bin(interval '1 hour', time at time zone 'Europe/Brussels') 

In the query above, we exclude the daylight savings hour of that year for that timezone.

To Reproduce

Execute the following query

select 
    date_bin(interval '1 hour', time at time zone 'Europe/Brussels') 
from 
    raw_data 
where 
    time >= '2021-03-27T22:00:00.000Z' 
and 
    time <= '2021-03-29T00:00:00.000Z'
group by date_bin(interval '1 hour', time at time zone 'Europe/Brussels') 

Expected behavior

I should get the data aggregated by time. Part of the interval will be with a 2 hours offset, another part will be with a 1 hour offset

Additional context

raw_data is a table that contains values for each 15 minutes. The table with its data is stored in influxDB 3.0 that uses the FDAP stack

No response

Abdullahsab3 avatar Apr 30 '24 08:04 Abdullahsab3

I think the wider image of this issue is the fact that the hour at which the daylight savings occur will not be able to get transformed to the desired timezone. For example:

select '2019-03-31T02:00:00Z'::timestamp at time zone 'Europe/Brussels';

This would also fail for the same reason

Abdullahsab3 avatar Apr 30 '24 12:04 Abdullahsab3

I remember the daylight savings time is quite tricky because it has ambiguous possible time.

See https://github.com/apache/datafusion/issues/8899#issuecomment-1900598499

jayzhan211 avatar May 02 '24 09:05 jayzhan211

I am testing some things out with the at time zone operator to understand it better. I think the UTC timestamp casting should be explictly specified first. Some context:

Daylight savings in 2019 is in March 31 at 2AM in local time. This is 2019-03-31T00:00:00 in UTC I think.

The query select '2019-03-31T02:00:00Z'::timestamp at time zone 'Europe/Brussels'; seems to be ignoring the Z at the end, and it considers the timestamp local time.

Some experiments:

select '2019-03-31T02:00:00Z'::timestamp at time zone 'Europe/Brussels';

result> Arrow error: Cast error: Cannot cast timezone to different timezone
select '2019-03-31T00:00:00'::timestamp at time zone 'UTC' at time zone 'Europe/Brussels'
result> 2019-03-31T01:00:00+01:00
select '2019-03-31T01:00:00'::timestamp at time zone 'UTC' at time zone 'Europe/Brussels'
result> 2019-03-31T03:00:00+02:00

It seems to me that, regardless of the Z provided in the query, the timestamp will considered as a naive date time, and the additional timezone information that is provided in the query will be considered the source of truth.

Abdullahsab3 avatar May 02 '24 09:05 Abdullahsab3