datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[EPIC] Proposal for Date/Time enhancement

Open waitingkuo opened this issue 2 years ago • 1 comments

!!! Please correct me if i'm wrong !!!

Intro

Design Principle

  1. align with postgresql in most cases
  2. use nanoseconds for the default timestamp resolution (unlike postgresql and pyspark's microseconds)
  3. use utc+0 as default timezone for timestamp with time zone

Let's Begin with Postgresql's Date/Time

image

Let's Start to Compare


Timestamp

Postgresql

  • uses 8 bytes for both timestamp and timestamp with time zone. (note that time zone is included in these 8 bytes)
  • uses microsecond as resolution, which is the number of microseconds from 1970-01-01T00:00:00
  • has 7 kinds of floating point precision, from timestamp(0) to timestamp(6)
    • timestamp(0) rounds to seconds
    • timestamp(3) rounds to milliseconds
    • timestamp(6) rounds to microseconds
  • timestamp 'xxx' output timestamp
  1. if xxx does't contain time zone info, it just works as what you think
willy=# select timestamp '2000-01-01T00:00:00';
      timestamp      
---------------------
 2000-01-01 00:00:00
(1 row)
  1. if xxx contains time zone info, time zone is just ignored. (i believe that this is a surprise for some people) e.g.
willy=# select timestamp '2000-01-01T00:00:00+08:00';
      timestamp      
---------------------
 2000-01-01 00:00:00
(1 row)
  • timestamp with time zone 'xxx' output timestamp with time 1 if xxx contains no time zone, it assume it's local time
willy=# select timestamp with time zone '2000-01-01T00:00:00';
      timestamptz       
------------------------
 2000-01-01 00:00:00+08
(1 row)

2 if xxx contains time zone, it'll be converted to your local time zone

willy=# select timestamp with time zone '2000-01-01T00:00:00+02:00';
      timestamptz       
------------------------
 2000-01-01 06:00:00+08
(1 row)

Datafusion

  • Timestamp(TimeUnit, Option<String>)
    • we have
      • TimeUnit::Second
      • TimeUnit::MilliSecond
      • TImeUnit::MicroSecond
      • TimeUnit::NanoSecond
    • which store number of seconds/millis/micros/nanos from 1970-01-01T00:00:00
    • most of the timestamp related functions output Timestamp(TimeUnit::NanoSecond, None)
  • We only have timestamp literal but no timestamp with time zone
  • timestamp xxx outputs Timestamp(TimeUnit::NanoSecond, None)
  1. if xxx contains no time zone, it automatically applies local time, parse it, convert it to utc time zone, and then drop the time zone #3080
❯ select cast('2000-01-01T00:00:00' as timestamp);
+------------------------------------------------------------------+
| CAST(Utf8("2000-01-01T00:00:00") AS Timestamp(Nanosecond, None)) |
+------------------------------------------------------------------+
| 1999-12-31 16:00:00                                              |
+------------------------------------------------------------------+
1 row in set. Query took 0.004 seconds.
  1. if xxx contains time zone, it's parsed correctly, then converted to utc time zone, and then drop the time zone
❯ select timestamp '2000-01-01T00:00:00+02:00';
+------------------------------------------------------------------------+
| CAST(Utf8("2000-01-01T00:00:00+02:00") AS Timestamp(Nanosecond, None)) |
+------------------------------------------------------------------------+
| 1999-12-31 22:00:00                                                    |
+------------------------------------------------------------------------+
1 row in set. Query took 0.000 seconds.

Proposal

  • make timestamp xxx work like postgresql does
  • add timestamp with time zone, i believe there're lots of works and discussions to do: apache/arrow-rs#1936 apache/arrow-rs#1380 apache/arrow-rs#597
  • make the local time zone as UTC by default (timestamp is hard, i think this could prevent some ambiguities)
  • add set time zone to xxx to change the local time zone

Date

postgresql image

Posgresql

  • supported these formats image

Datafusion

  • among all the format above, we only support the first 1999-01-08

Proposal

  • I don't think there're any issues for Date
  • We could consider add another 2 ISO 8601 formats (i.e. 19990108 and 990108) Chrono strictly follows ISO 8601. I think supporting all 8601 date formats makes sense.

Time

Postgresql

  • time xxx output time that requires 8 bytes
  • time xxx with time zone that requires 12 bytes, I have no idea why we need 4 more bytes here since timestamp with time zone only requires 8 bytes

Datafusion

  • We don't have time literal for now, let's wait for #3010

Proposal

  • I personally never used time with time zone. I have no clue when we need it. arrow-rs's time datatype contains no timezone. Perhaps we need not to implement this.
  • let's wait for #3010

Interval

Postgresql

  • requires 16 bytes
  • resolution as microseconds
  • the outputs for different operators & inputs image image

Datafusion

reference: https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs#L237

  • Interval(IntervalUnit)
  • we have following units
    • IntervalUnit::YearMonth
      • number of months
      • stored as 32-bit integer
    • IntervalUnit::DayTime
      • stored as 2 contiguous 32-bit integers (days, millisseconds), 8 bytes in total
    • IntervalUnit::MonthDayNano
    • a triple of the number of (months, days, nanoseconds)
      • month is stored as 32-bit integers
      • day is stored as 32-bit integers
      • nanosecond is stored as 64 bit integers
      • 16 bytes in total
  • interval xxx output Interval(DayTime)
❯ select interval '1 hour 1 second';
+------------------------------------------------+
| IntervalDayTime("3601000")                     |
+------------------------------------------------+
| 0 years 0 mons 0 days 1 hours 0 mins 1.00 secs |
+------------------------------------------------+
1 row in set. Query took 0.000 seconds.
  • interval xxx support floating number seconds
❯ select interval '0.1 second';
+-------------------------------------------------+
| IntervalDayTime("100")                          |
+-------------------------------------------------+
| 0 years 0 mons 0 days 0 hours 0 mins 0.100 secs |
+-------------------------------------------------+
1 row in set. Query took 0.000 seconds.
  • if it's less than microsecond, it'll truncated
❯ select interval '0.0001 second';
+------------------------------------------------+
| IntervalDayTime("0")                           |
+------------------------------------------------+
| 0 years 0 mons 0 days 0 hours 0 mins 0.00 secs |
+------------------------------------------------+
1 row in set. Query took 0.000 seconds.
  • we cannot add interval(DayTime) to Timestamp(NanoSecond, None), perhaps the reason here is the difference of resolution
❯ select timestamp '2000-01-01Z' + interval '1 day';
Plan("'Timestamp(Nanosecond, None) + Interval(DayTime)' can't be evaluated because there isn't a common type to coerce the types to")
  • we can add interval(DayTime) to Date
❯ select DATE '2000-01-01' + INTERVAL '1 day';
+--------------------------------------------------------------------+
| CAST(Utf8("2000-01-01") AS Date32) + IntervalDayTime("4294967296") |
+--------------------------------------------------------------------+
| 2000-01-01                                                         |
+--------------------------------------------------------------------+
1 row in set. Query took 0.000 seconds.
  • it breaks while we have hour (or other smaller units) interval #3093
❯ select DATE '2000-01-01' + INTERVAL '1 hour';
+-----------------------------------------------------------------+
| CAST(Utf8("2000-01-01") AS Date32) + IntervalDayTime("3600000") |
+-----------------------------------------------------------------+
| +11856-06-19                                                    |
+-----------------------------------------------------------------+
1 row in set. Query took 0.000 seconds.
  • We don't have Time now, let's wait for #3010

Proposal

  • Consider make INTERVAL xxx outputs Interval(MonthDayNano) instead of Interval(DayTime) as it's easier to align with our Timestamp(NanoSecond, None)
  • Carefully design the outputs for operators like what postgresql has
  • we could think about whether we really need timestamp with time zone - timestamp ... this is what postgresql has
willy=# select timestamp with time zone '2000-01-01T00:00:00Z' - timestamp '2000-01-01T00:00:00';
 ?column? 
----------
 08:00:00
(1 row)

waitingkuo avatar Aug 10 '22 21:08 waitingkuo

Thanks @waitingkuo -- I plan to review this proposal more carefully later today or tomorrow. cc @avantgardnerio and @andygrove and @ovr who I think have been thinking about /working on timestamp related things as well

alamb avatar Aug 11 '22 10:08 alamb

(I will respond individually to the parts of the proposal)

Timestamp Proposal

  1. make timestamp xxx work like postgresql does
  2. add timestamp with time zone, i believe there're lots of works and discussions to do:...
  3. make the local time zone as UTC by default
  4. add set time zone to xxx to change the local time zone

I think this sounds like a great idea 👍

alamb avatar Aug 11 '22 21:08 alamb

Date Proposal

  1. I don't think there're any issues for Date
  2. We could consider add another 2 ISO 8601 formats (i.e. 19990108 and 990108) Chrono strictly follows ISO 8601. I think supporting all 8601 date formats makes sense.

I agree

alamb avatar Aug 11 '22 21:08 alamb

Time Proposal

  1. I personally never used time with time zone. I have no clue when we need it. arrow-rs's time datatype contains no timezone. Perhaps we need not to implement this.
  2. let's wait for https://github.com/apache/arrow-datafusion/pull/3010

I agree -- until there is a compelling reason for time (rather than timestamp) it seems reasonable to postpone additional work on this

alamb avatar Aug 11 '22 21:08 alamb

Interval Proposal

  1. Consider make INTERVAL xxx outputs Interval(MonthDayNano) instead of Interval(DayTime) as it's easier to align with our Timestamp(NanoSecond, None)

Here is the spec in case that is interesting: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L354-L372

I think this is a good idea -- Interval(MonthDayNano) is a relatively new addition to the arrow standard, so I think DataFusion might default to Interval(DayTime) is because that was all that was available at the time of initial implementation.

  1. Carefully design the outputs for operators like what postgresql has

Agree

  1. we could think about whether we really need timestamp with time zone - timestamp ... this is what postgresql has

I don't quite understand the question about "really need timestamp with time zone - timestamp" -- as I thought your proposal for timezones was to fill out the feature set for timestamp with timezone 🤔

alamb avatar Aug 11 '22 21:08 alamb

  1. we could think about whether we really need timestamp with time zone - timestamp ... this is what postgresql has

I don't quite understand the question about "really need timestamp with time zone - timestamp" -- as I thought your proposal for timezones was to fill out the feature set for timestamp with timezone 🤔

i originally thought that comparing timestamp with time zone and timestamp without time zone is ambiguous. it turns out that postgresql simply drop the time zone and keep the native timestamp. this is quite straightforward for our implementation as well (i.e. Drop the time zone from Timestamp(TimeUnit, Optinal(Timezone)). i'll update my proposal for this

waitingkuo avatar Aug 12 '22 02:08 waitingkuo

@waitingkuo what do you think about making a list on this ticket of all the various timestamp / time related open items? I think there are a non trivial number of them, such as https://github.com/apache/arrow-datafusion/issues/194, https://github.com/apache/arrow-datafusion/issues/3103 and several others? Alternately, we can make another ticket that collects the subtasks

alamb avatar Aug 12 '22 10:08 alamb

perhaps the reason here is the difference of resolution

There is no reason, it just isn't implemented. I implemented add for Date32 & Date64, but not Timestamp yet.

avantgardnerio avatar Aug 12 '22 14:08 avantgardnerio

it breaks while we have hour

I believe a Date has the resolution of 24 hours, so adding 1 or 23 hours should have no effect?

avantgardnerio avatar Aug 12 '22 14:08 avantgardnerio

INTERVAL xxx outputs Interval(MonthDayNano)

I'm wary of this because it changes Interval from being a true Duration (fixed unit of wall clock time, e.g. milliseconds elapsed) into a Gregorian calendar unit (due to supporting Month).

avantgardnerio avatar Aug 12 '22 15:08 avantgardnerio

2. add another 2 ISO 8601 formats

I personally lean towards only supporting one, especially if Chrono only supports one. https://xkcd.com/1179/ https://xkcd.com/927/

avantgardnerio avatar Aug 12 '22 15:08 avantgardnerio

@waitingkuo I think 95% of what you propose is very sensible. The other 5% I don't think is incorrect, but does raise cause for consideration.

Some random thoughts:

  1. I generally think we should try to stick to parity with postgres, as it is ubiquitous & mature, makes integration testing easier, and arbitrates disputes about desired behavior
  2. counterpoint to # 1: arrow is the foundation for datafusion, and as such I think we should go with it over postgres when pragmatic / sensible to do so
  3. when possible I think we should also delegate date / time things to chrono, as these things are very difficult, and it would be nice to not have to maintain a date/time library as well as a parser, query engine, etc
  4. as we push towards higher performance, I think # 3 will become increasingly less possible (unless we want to merge GPU/SIMD support into chrono)
  5. We should be clear that there are two types of timezone: Pacific Time (a legal entity) and PST/PDT (UTC+8/7 hours). It appears postgres only supports the later? Which is kind of strange because it limits the purpose of timezones IMO - having a meeting at a recurring time throughout the year requires application logic to switch between PST & PDT when appropriate.
  6. The above legal manifestations add a lot of complexity. On linux chrono pulls its database from tzinfo on linux, and I'm not sure if there's an equivalent on windows. Supporting those means having an always-updated database outside of our codebase. (so I hope to either defer to chrono, or not do it).
  7. Its important to remember that a timezone (of any kind) is not related to time, but rather location. It's where something is happening, not when.
  8. Due to all of the above I've found it useful in OLTP systems to keep all timestamps as UTC and convert to the user locale when displaying. That being said, datafusion is designed for OLAP, so I can see uses where the user has a SQL console open and there is no app layer to convert, which makes a good case for supporting timezones in the query engine, but it comes with a high cost of implementation, so we should be very considerate before adding each layer of complexity (UTC offsets, calender operations, timezones).

avantgardnerio avatar Aug 12 '22 15:08 avantgardnerio

@waitingkuo what do you think about making a list on this ticket of all the various timestamp / time related open items? I think there are a non trivial number of them, such as #194, #3103 and several others? Alternately, we can make another ticket that collects the subtasks

@alamb sure! i'll do it

waitingkuo avatar Aug 12 '22 18:08 waitingkuo

it breaks while we have hour

I believe a Date has the resolution of 24 hours, so adding 1 or 23 hours should have no effect?

this issue is fixed, it works now

waitingkuo avatar Aug 12 '22 18:08 waitingkuo

  1. add another 2 ISO 8601 formats

I personally lean towards only supporting one, especially if Chrono only supports one. https://xkcd.com/1179/ https://xkcd.com/927/

agree, the current one is the most widely used

waitingkuo avatar Aug 12 '22 18:08 waitingkuo

Due to all of the above I've found it useful in OLTP systems to keep all timestamps as UTC and convert to the user locale when displaying.

I think this is best practice for all new OLTP and OLAP systems (always store data as UTC and then render into user locale as desired).

I think what makes it tougher in OLAP system is that the data being processed is often created outside of Datafusion (aka in parquet or CSV files from some other system) which can and unforuntaely do write dates / times / timestamps other than UTC

In general, I would be a big fan of trying if possible of having datafusion normlize all data on input to UTC prior to processing but I worry this might not be a good idea for performance reasons.

alamb avatar Aug 15 '22 12:08 alamb

In general, I would be a big fan of trying if possible of having datafusion normlize all data on input to UTC prior to processing but I worry this might not be a good idea for performance reasons.

Postgrseql deal with this in the similar way https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-TIMEZONES

All timezone-aware dates and times are stored internally in UTC. They are converted to local time in the zone specified by the TimeZone configuration parameter before being displayed to the client.

e.g.

willy=# create table test02 (c1 timestamptz);
CREATE TABLE
willy=# insert into test02 (c1) values ('2000-01-01T00:00:00+08:00');
INSERT 0 1
willy=# insert into test02 (c1) values ('2000-01-01T00:00:00+07:00');
INSERT 0 1
willy=# insert into test02 (c1) values ('2000-01-01T23:00:00+07:00');
INSERT 0 1
willy=# select * from test02;
           c1           
------------------------
 2000-01-01 00:00:00+08
 2000-01-01 01:00:00+08
 2000-01-02 00:00:00+08
(3 rows)

But it does use local time while we try to extract the day e.g.

willy=# select extract(day from c1), COUNT(*) from test02 group by extract(day from c1);
 extract | count 
---------+-------
       1 |     2
       2 |     1
(2 rows)

waitingkuo avatar Aug 15 '22 14:08 waitingkuo

When we are done with the discussion, perhaps we can close this issue and use https://github.com/apache/arrow-datafusion/issues/3148 as the coordination point for development

alamb avatar Aug 15 '22 18:08 alamb

When we are done with the discussion, perhaps we can close this issue and use #3148 as the coordination point for development

i think we could close it for now and move the discussion to #3148 and submit new ticket to discuss some big topics (e.g. timezone)

waitingkuo avatar Aug 15 '22 20:08 waitingkuo