datafusion
datafusion copied to clipboard
to_timestamp timeunit to be consistent with postgresql's
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Current to_timestamp assume the input number is nanoseconds from 1970
❯ select to_timestamp(10);
+-------------------------------+
| totimestamp(Int64(10)) |
+-------------------------------+
| 1970-01-01 00:00:00.000000010 |
+-------------------------------+
1 row in set. Query took 0.000 seconds.
while postgresql's is to assume the number is seonds from 1970
willy=# select to_timestamp(10) at time zone 'utc';
timezone
---------------------
1970-01-01 00:00:10
(1 row)
Should we align with postgrseql's?
Describe the solution you'd like
- rename the original
to_timestampas to_timestamp_nanoseconds - we do have a to_timestamp_seconds for now. we could either rename it to
to_timestampor make a newto_timestampasto_timestamp_seconds's alias
Hi @waitingkuo -- I think given the desire to make datafusion "A Declarative SQL query interface compatible with PostgreSQL" as described on https://arrow.apache.org/datafusion/specification/roadmap.html#datafusion compatible changing the current to_timestamp to be to_timestamp_nanos() and changing to_timestamp() to be the same as the current to_timestamp_seconds would be reasonable
Here are the docs, in case anyone is interested
https://github.com/apache/arrow-datafusion/blob/2598893ac7737e92b3be7b74f606280b32264e0e/docs/source/user-guide/sql/datafusion-functions.md#to_timestamp
i'm interested in this, please assign me, thanks~
@alamb after investigating more on postgresql's, i found that we have following differences:
Differences
- postgresql is using microsecond for the resolution of timestamp,
datafusionis using nanoseconds
This is postgresql's

- postgresql
to_timestamphave 2 kinds of signatures:
- 2.1
to_timestamp(double precision) -> timestamp with time zone
https://www.postgresql.org/docs/14/functions-datetime.html

the input for to_timestamp in posgresql has the unit second. i.e. to_timestamp(1) means 1 second after 1970 but not 1 microconds.
Current datafusion supports Int64, TimeUnit(Second), TimeUnit(MicroSecond), TimeUnit(MilliSecond), TimeUnit(NanoSeconds), and UTF-8. Our current to_timestamp(int64) deems the input as nanoseconds instead of seconds which is different from postgresql's. Postresql doesnt support to_timestamp(TimeUnit). I suggest to align to_timestamp(i64) to posgresql's; and add another to_timestamp(float64). And leave the to_timestamp(TimeUnit) behave the same as what we have now (conversion between units). to_timestamp(UTF-8) is in 2.2
- 2.2
to_timestamphttps://www.postgresql.org/docs/14/functions-formatting.html

datafusion currently supports to_timestamp(utf-8) without the 2nd argument datafusion uses chrono
) to parse the time string without specify the format (while posgresql's need us to specify the format)
- postgresql's
to_timestampoutput timestamp with timezone, whiledatafusionoutputsTimeUnit(UNIT, None). The second argument forTimeUnitinarrowin implements timezone, just we don't use it into_timestampyet
Summary & My proposal
- keep using nanosecond as default while there's no specific timeunit set
- make follwings:
to_timestamp(i64)recognize input as second, outputTimestamp(NanoSecond, None)- (e.g.
to_timestamp(1) -> 1970-01-01:00:00:01)
- (e.g.
to_timestamp_nanos(i64)recognize input as nanoseconds, outputTimestamp(NanoSecond, None)- (e.g.
to_timestamp_nanos(1) -> 1970-01-01:00:00:00.000000001)
- (e.g.
to_timestamp_millis(I64)recognize input as milliseconds, outputTimestamp(MilliSecond, None)- (e.g.
to_timestamp_millis(1) -> 1970-01-01:00:00:01.000001)
- (e.g.
to_timestamp_micros(I64)recognize input as microsecond, outputTimestamp(MicroSecond, None)- (e.g.
to_timestamp_micros(1) -> 1970-01-01:00:00:00.001)
- (e.g.
to_timestamp_seconds(I64)recognize input as second, outputTimestamp(Second, None)- (e.g.
to_timestamp_seconds(1) -> 1970-01-01:00:00:01)
- (e.g.
- support
to_timestamp(float64)and `to_timestamp_xx(float64) - keep using chorno's time string format for now
- start another thread for dicussing timezone implementation, I feel like this might break many things. And we also need to implement something like
SET TIME ZONE AS xxxlike what posgresql has
Please leave your comments. I'll submit another issue for tracking all of these.
Summary & My proposal
Makes sense to me. 👍
cc @ovr @andygrove @seddonm1 @stuartcarnie (not sure if anyone else is interested in this discussion too)
start another thread for dicussing timezone implementation, I feel like this might break many things. And we also need to implement something like SET TIME ZONE AS xxx like what posgresql has
There are a variety of open issues known about timestamp handling -- for example https://github.com/apache/arrow-rs/issues/1936 https://github.com/apache/arrow-rs/issues/1380 https://github.com/apache/arrow-rs/issues/597 and others
It would be great to figure out how best to support these and start fixing it up.
@alamb should we consider to_timestamp() to output timestamp with time zone?
in postgresql select timestamp '2020-01-01' output timestamp without time zone but to_timestamp() output timestamp with time zone.
@alamb should we consider to_timestamp() to output timestamp with time zone?
I think that can be a long term goal -- I suspect making that change will be hard / impossible without improving the timezone support in general across DataFusion
@alamb i could image that might breaks lots of thing there're some time zone issues in datafusion for now as well e.g. #3080
@alamb after spend time figuring out how postgresql is working, i submitted a proposal here #3100
probably we need to deal with datatype date / time / timestamp / time zone first and then move to this timestamp related functions
probably we need to deal with datatype date / time / timestamp / time zone first and then move to this timestamp related functions
I agree that sounds wise
@alamb it seems like now is a great time to align time-related functions with PostgreSQL, as most of the fundamental work has already been completed.
@alamb it seems like now is a great time to align time-related functions with PostgreSQL, as most of the fundamental work has already been completed.
I agree -- it would be a good time.
@waitingkuo would you be willing to make a new "date/time" epic and move all the unfinished stuff from https://github.com/apache/arrow-datafusion/issues/3148 to that new epic as a way to figure out what else is needed?
@alamb Sure, I am currently on vacation, and I will do it a little later.
Thanks @waitingkuo -- no rush!