datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

to_timestamp timeunit to be consistent with postgresql's

Open waitingkuo opened this issue 3 years ago • 9 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Current to_timestamp assume the input number is nanoseconds from 1970

❯ select to_timestamp(10);
+-------------------------------+
| totimestamp(Int64(10))        |
+-------------------------------+
| 1970-01-01 00:00:00.000000010 |
+-------------------------------+
1 row in set. Query took 0.000 seconds.

while postgresql's is to assume the number is seonds from 1970

willy=# select to_timestamp(10) at time zone 'utc';
      timezone       
---------------------
 1970-01-01 00:00:10
(1 row)

Should we align with postgrseql's?

Describe the solution you'd like

  1. rename the original to_timestamp as to_timestamp_nanoseconds
  2. we do have a to_timestamp_seconds for now. we could either rename it to to_timestamp or make a new to_timestamp as to_timestamp_seconds's alias

waitingkuo avatar Jul 28 '22 10:07 waitingkuo

Hi @waitingkuo -- I think given the desire to make datafusion "A Declarative SQL query interface compatible with PostgreSQL" as described on https://arrow.apache.org/datafusion/specification/roadmap.html#datafusion compatible changing the current to_timestamp to be to_timestamp_nanos() and changing to_timestamp() to be the same as the current to_timestamp_seconds would be reasonable

Here are the docs, in case anyone is interested

https://github.com/apache/arrow-datafusion/blob/2598893ac7737e92b3be7b74f606280b32264e0e/docs/source/user-guide/sql/datafusion-functions.md#to_timestamp

alamb avatar Jul 30 '22 13:07 alamb

i'm interested in this, please assign me, thanks~

waitingkuo avatar Jul 30 '22 15:07 waitingkuo

@alamb after investigating more on postgresql's, i found that we have following differences:

Differences

  1. postgresql is using microsecond for the resolution of timestamp, datafusion is using nanoseconds

This is postgresql's image

  1. postgresql to_timestamp have 2 kinds of signatures:
  • 2.1 to_timestamp(double precision) -> timestamp with time zone

https://www.postgresql.org/docs/14/functions-datetime.html

image

the input for to_timestamp in posgresql has the unit second. i.e. to_timestamp(1) means 1 second after 1970 but not 1 microconds. Current datafusion supports Int64, TimeUnit(Second), TimeUnit(MicroSecond), TimeUnit(MilliSecond), TimeUnit(NanoSeconds), and UTF-8. Our current to_timestamp(int64) deems the input as nanoseconds instead of seconds which is different from postgresql's. Postresql doesnt support to_timestamp(TimeUnit). I suggest to align to_timestamp(i64) to posgresql's; and add another to_timestamp(float64). And leave the to_timestamp(TimeUnit) behave the same as what we have now (conversion between units). to_timestamp(UTF-8) is in 2.2

image

datafusion currently supports to_timestamp(utf-8) without the 2nd argument datafusion uses chrono ) to parse the time string without specify the format (while posgresql's need us to specify the format)

  1. postgresql's to_timestamp output timestamp with timezone, while datafusion outputs TimeUnit(UNIT, None). The second argument for TimeUnit in arrow in implements timezone, just we don't use it in to_timestamp yet

Summary & My proposal

  1. keep using nanosecond as default while there's no specific timeunit set
  2. make follwings:
  • to_timestamp(i64) recognize input as second, output Timestamp(NanoSecond, None)
    • (e.g. to_timestamp(1) -> 1970-01-01:00:00:01)
  • to_timestamp_nanos(i64) recognize input as nanoseconds, output Timestamp(NanoSecond, None)
    • (e.g. to_timestamp_nanos(1) -> 1970-01-01:00:00:00.000000001)
  • to_timestamp_millis(I64) recognize input as milliseconds, output Timestamp(MilliSecond, None)
    • (e.g. to_timestamp_millis(1) -> 1970-01-01:00:00:01.000001)
  • to_timestamp_micros(I64) recognize input as microsecond, output Timestamp(MicroSecond, None)
    • (e.g. to_timestamp_micros(1) -> 1970-01-01:00:00:00.001)
  • to_timestamp_seconds(I64) recognize input as second, output Timestamp(Second, None)
    • (e.g. to_timestamp_seconds(1) -> 1970-01-01:00:00:01)
  1. support to_timestamp(float64) and `to_timestamp_xx(float64)
  2. keep using chorno's time string format for now
  3. start another thread for dicussing timezone implementation, I feel like this might break many things. And we also need to implement something like SET TIME ZONE AS xxx like what posgresql has

Please leave your comments. I'll submit another issue for tracking all of these.

waitingkuo avatar Aug 08 '22 14:08 waitingkuo

Summary & My proposal

Makes sense to me. 👍

cc @ovr @andygrove @seddonm1 @stuartcarnie (not sure if anyone else is interested in this discussion too)

start another thread for dicussing timezone implementation, I feel like this might break many things. And we also need to implement something like SET TIME ZONE AS xxx like what posgresql has

There are a variety of open issues known about timestamp handling -- for example https://github.com/apache/arrow-rs/issues/1936 https://github.com/apache/arrow-rs/issues/1380 https://github.com/apache/arrow-rs/issues/597 and others

It would be great to figure out how best to support these and start fixing it up.

alamb avatar Aug 09 '22 13:08 alamb

@alamb should we consider to_timestamp() to output timestamp with time zone?

in postgresql select timestamp '2020-01-01' output timestamp without time zone but to_timestamp() output timestamp with time zone.

waitingkuo avatar Aug 09 '22 14:08 waitingkuo

@alamb should we consider to_timestamp() to output timestamp with time zone?

I think that can be a long term goal -- I suspect making that change will be hard / impossible without improving the timezone support in general across DataFusion

alamb avatar Aug 09 '22 15:08 alamb

@alamb i could image that might breaks lots of thing there're some time zone issues in datafusion for now as well e.g. #3080

waitingkuo avatar Aug 09 '22 16:08 waitingkuo

@alamb after spend time figuring out how postgresql is working, i submitted a proposal here #3100

probably we need to deal with datatype date / time / timestamp / time zone first and then move to this timestamp related functions

waitingkuo avatar Aug 10 '22 21:08 waitingkuo

probably we need to deal with datatype date / time / timestamp / time zone first and then move to this timestamp related functions

I agree that sounds wise

alamb avatar Aug 11 '22 10:08 alamb

@alamb it seems like now is a great time to align time-related functions with PostgreSQL, as most of the fundamental work has already been completed.

waitingkuo avatar Oct 19 '23 14:10 waitingkuo

@alamb it seems like now is a great time to align time-related functions with PostgreSQL, as most of the fundamental work has already been completed.

I agree -- it would be a good time.

@waitingkuo would you be willing to make a new "date/time" epic and move all the unfinished stuff from https://github.com/apache/arrow-datafusion/issues/3148 to that new epic as a way to figure out what else is needed?

alamb avatar Oct 20 '23 17:10 alamb

@alamb Sure, I am currently on vacation, and I will do it a little later.

waitingkuo avatar Oct 21 '23 03:10 waitingkuo

Thanks @waitingkuo -- no rush!

alamb avatar Oct 21 '23 20:10 alamb