datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[EPIC] A collection of Date/Time related open issues

Open waitingkuo opened this issue 3 years ago • 7 comments

This is followed on the Proposal for Date/Time enhancement #3100 Please edit the list or comment bellow if you find something not listed here.

Date/Time/Timestamp Types & Casting

  • [x] #2883 - This is the initial work for TIME literal (i.e. make SELECT TIME '00:00:00'; work)
  • [ ] apache/arrow-rs#2074 - This is to improve the performance of casting timestamp between different unit
  • [ ] #193 - This depends on #2883
  • [x] #3082

Timestamp With Time Zone Date Types & Casting

  • [ ] apache/arrow-rs#597
  • [ ] apache/arrow-rs#599
  • [ ] apache/arrow-rs#168
  • [ ] #153
  • [ ] #3080
  • [ ] #959
  • [ ] #3149
  • [x] sqlparser-rs/sqlparser-rs#588
  • [x] sqlparser-rs/sqlparser-rs#464
  • [ ]

Arithmetic (including intervals)

  • [x] #200
  • [ ] #194
  • [x] #3103
  • [ ] apache/arrow-rs#593
  • [ ] apache/arrow-rs#527
  • [x] apache/arrow-rs#187
  • [ ] apache/arrow-rs#2457 - arrow-rs
  • [ ] #3157 - datafusion
  • [x] #3166
  • [x] #3180
  • [ ] apache/arrow-rs#1065
  • [ ] #3204

Functions

  • [ ] #3096
  • [x] #3057
  • [ ] #2979
  • [ ] #1992
  • [ ] #765
  • [ ] #686
  • [ ] apache/arrow-rs#1380
  • [ ] #765

parquet/csv/json

  • [x] apache/arrow-rs#982
  • [x] apache/arrow-rs#1191
  • [ ] apache/arrow-rs#1060
  • [ ] apache/arrow-rs#455
  • [ ] apache/arrow-rs#38
  • [ ] #2044
  • [ ] #3176

others

  • [ ] apache/arrow-rs#902
  • [x] #3255
  • [x] sqlparser-rs/sqlparser-rs#303

waitingkuo avatar Aug 15 '22 15:08 waitingkuo

@alamb @avantgardnerio i created a separated ticket for tracking timestamp related open issues (from both datafusion and arrow-rs)

waitingkuo avatar Aug 15 '22 15:08 waitingkuo

This is amazing @waitingkuo -- thank you

alamb avatar Aug 15 '22 16:08 alamb

@waitingkuo you can @ me to review your pr about the times. I care about this part

liukun4515 avatar Aug 17 '22 11:08 liukun4515

Proposal:

We attack this in multiple steps parts:

  1. Add timestamp but not timestamp with timezone
  2. Hard code the server "local" time zone to always be UTC, such that show time zone always returns UTC and set time zone 'UTC' is the only set command that will not produce an error
  3. Add timestamp with timezone, but it will always be UTC due to the above
  4. Add support for set time zone 'UTC+8'
  5. Add support for set timezone 'US/Mountain

I think the above plan will allow us to tackle the problem in stages, without ever technically breaking ANSI compliance and reaching full compliance at the end state. Caveat: I'm not sure if we will ever reach the end state, and I'm okay with that personally.

avantgardnerio avatar Aug 24 '22 18:08 avantgardnerio

Thank you @avantgardnerio and @waitingkuo for driving this issue. It is awesome to see

alamb avatar Aug 31 '22 13:08 alamb

@alamb @avantgardnerio @liukun4515 i just finished the first 2 from @avantgardnerio's proposal

  1. Add timestamp but not timestamp with timezone
  2. Hard code the server "local" time zone to always be UTC, such that show time zone always returns UTC and set time zone 'UTC' is the only set command that will not produce an error

now we have

DataFusion CLI v12.0.0
❯ show time zone;
+--------------------------------+---------+
| name                           | setting |
+--------------------------------+---------+
| datafusion.execution.time_zone | UTC     |
+--------------------------------+---------+
1 row in set. Query took 0.059 seconds.

i'm just starting the work for timestamp with timezone, things become trickier now

postgresql has this https://www.postgresql.org/docs/current/functions-formatting.html

to_timestamp ( text, text ) → timestamp with time zone

Converts string to time stamp according to the given format. (See also to_timestamp(double precision) in Table 9.32.)

to_timestamp('05 Dec 2000', 'DD Mon YYYY') → 2000-12-05 00:00:00-05

and this https://www.postgresql.org/docs/current/functions-datetime.html

to_timestamp ( double precision ) → timestamp with time zone

Convert Unix epoch (seconds since 1970-01-01 00:00:00+00) to timestamp with time zone to_timestamp(1284352323) → 2010-09-13 04:32:03+00

Looks like its to_timestamp always return timestamp with time zone

In general, do you think we should align these function signatures with postgresql? There will be more things to change once we have timestamp with non-UTC timezone 💔

BTW Datafusion's now() output timestamp with time zone which is same as what postgresql has 👍

waitingkuo avatar Sep 14 '22 20:09 waitingkuo

In general, do you think we should align these function signatures with postgresql? There will be more things to change once we have timestamp with non-UTC timezone 💔

I think it would be ok to align the functions with postgres as long as there is some clear way to get the current behavior as well (e.g. the query needs to be tweaked or something)

alamb avatar Sep 15 '22 17:09 alamb

I filed https://github.com/apache/arrow-datafusion/issues/3980 and added it to this ticket -- I plan to add more date/time open issues here to help track them

alamb avatar Oct 27 '22 13:10 alamb

I think #2785 belongs in this list as well!

mjvankampen avatar Jan 10 '23 20:01 mjvankampen

added, thank you @mjvankampen

waitingkuo avatar Jan 11 '23 12:01 waitingkuo

Filed See https://github.com/apache/arrow-datafusion/issues/5753 to track work for intervals

alamb avatar Mar 27 '23 11:03 alamb

We're ramping up our usage of delta-rs (and with it arrow-datafusion) and would like to see resolved much of the weirdness with dates/times (ignoring the absolute pain Spark brings upstream). Don't want to duplicate effort, so would help with #959 (and #6876) be welcomed?

watfordkcf avatar Jul 10 '23 12:07 watfordkcf

Hi @watfordkcf -- thanks!

I believe @tustvold is in the middle of cleaning and making consistent the various timestamp arithmetic / kernels / intervals

The largest outstanding gap that I know of is correct timezone handling (basically anything other than UTC timestamps are like to hit corner cases / bugs) -- this is related to https://github.com/apache/arrow-datafusion/issues/959

In terms of https://github.com/apache/arrow-datafusion/issues/6876, that would be lovely I think.

alamb avatar Jul 10 '23 12:07 alamb

Do we have any method to convert or cast the integer type/integer expr with the timeunit to the interval?

@waitingkuo @alamb

I want to implement a app or function like this datetime + integer with timeunit, and the timeunit may be year,month,day.

I can't find the any method to convert the integer with timeunit to the interval, but in the PG which has a function make_interval() https://www.postgresql.org/docs/current/functions-datetime.html can do this.

I just can find a way to resolve the issue by using the concat method.

For example: I have a table like below, and want to convert the b with dayunit to interval

❯ \d test
+---------------+--------------+------------+-------------+-----------+-------------+
| table_catalog | table_schema | table_name | column_name | data_type | is_nullable |
+---------------+--------------+------------+-------------+-----------+-------------+
| datafusion    | public       | test       | a           | Date32    | NO          |
| datafusion    | public       | test       | b           | Int64     | NO          |
+---------------+--------------+------------+-------------+-----------+-------------+
2 rows in set. Query took 0.013 seconds.

Using the concat to get the utf8 expr which can be casted to interval.

❯   select cast(concat(b,' day') as interval), b from test;
+-------------------------------------------------------+---+
| concat(test.b,Utf8(" day"))                           | b |
+-------------------------------------------------------+---+
| 0 years 0 mons 2 days 0 hours 0 mins 0.000000000 secs | 2 |
+-------------------------------------------------------+---+

liukun4515 avatar Jul 12 '23 03:07 liukun4515

@liukun4515 there is the interval constant syntax:

DataFusion CLI v27.0.0
❯ select interval '5 days';
+-------------------------------------------------------+
| IntervalMonthDayNano("92233720368547758080")          |
+-------------------------------------------------------+
| 0 years 0 mons 5 days 0 hours 0 mins 0.000000000 secs |
+-------------------------------------------------------+
1 row in set. Query took 0.025 seconds.

Also casting strings to intervals handle units


❯ select '5 days'::interval;
+-------------------------------------------------------+
| Utf8("5 days")                                        |
+-------------------------------------------------------+
| 0 years 0 mons 5 days 0 hours 0 mins 0.000000000 secs |
+-------------------------------------------------------+
1 row in set. Query took 0.003 seconds.

Or creating a string via || (which is the same as the concat you mention above)

❯ select (5 || ' days')::interval;
+-------------------------------------------------------+
| Int64(5) || Utf8(" days")                             |
+-------------------------------------------------------+
| 0 years 0 mons 5 days 0 hours 0 mins 0.000000000 secs |
+-------------------------------------------------------+
1 row in set. Query took 0.004 seconds.

alamb avatar Jul 12 '23 10:07 alamb

@liukun4515 there is the interval constant syntax:

DataFusion CLI v27.0.0
❯ select interval '5 days';
+-------------------------------------------------------+
| IntervalMonthDayNano("92233720368547758080")          |
+-------------------------------------------------------+
| 0 years 0 mons 5 days 0 hours 0 mins 0.000000000 secs |
+-------------------------------------------------------+
1 row in set. Query took 0.025 seconds.

Also casting strings to intervals handle units


❯ select '5 days'::interval;
+-------------------------------------------------------+
| Utf8("5 days")                                        |
+-------------------------------------------------------+
| 0 years 0 mons 5 days 0 hours 0 mins 0.000000000 secs |
+-------------------------------------------------------+
1 row in set. Query took 0.003 seconds.

Or creating a string via || (which is the same as the concat you mention above)

❯ select (5 || ' days')::interval;
+-------------------------------------------------------+
| Int64(5) || Utf8(" days")                             |
+-------------------------------------------------------+
| 0 years 0 mons 5 days 0 hours 0 mins 0.000000000 secs |
+-------------------------------------------------------+
1 row in set. Query took 0.004 seconds.

5|| ' DAY' is same with concat( 5, ' day') ?

I use this method concat to complete my requirements, but the performance may be not better.

https://github.com/apache/arrow-datafusion/issues/6876#issuecomment-1631086729 also provide the thoughts about support the make_interval function

cc @watfordkcf

liukun4515 avatar Jul 13 '23 10:07 liukun4515

Wow the list of tickets fixed is quite impressive: image

alamb avatar Sep 19 '23 01:09 alamb

I have moved all the incomplete items and moved them to https://github.com/apache/arrow-datafusion/issues/8282 to help track what is remaining. Thanks everyone for the epic work here

alamb avatar Nov 20 '23 14:11 alamb