datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Fix issue with "to_date" failing to process dates later than year 2262

Open MartinKolbAtWork opened this issue 1 year ago • 0 comments

Which issue does this PR close?

Closes https://github.com/apache/datafusion/issues/12226

Rationale for this change

The "to_date" function (https://github.com/apache/datafusion/blob/main/datafusion/functions/src/datetime/to_date.rs) fails to process dates that are later than year 2262. This is caused by the implementation detail that the conversion process uses nano-seconds based on epoch. The Arrow datatype for Date32 and Date64 support much larger values. The statements in some areas in the code state that the usage of nanoseconds is imposed by the Date types of Arrow. This is simply wrong. The Date32 type stores the number of days since epoch. The Date64 type stores the milliseconds (NOT nanoseconds) since epoch. Both Date32 and Date64 can therefore massively exceed the year 2262. See: https://arrow.apache.org/docs/cpp/api/datatype.html

NOTE: Processing dates later than 2262 is not a theoretical issue. In widely used business software systems, unbounded dates (e.g. an "expiry_date" that is set to never expire) are set to the 31st of December of the year 9999 (i.e. "9999-12-31"). Processing such data with datafusion will fail if the current "to_date" implementation touches this data.

What changes are included in this PR?

  • Changing the implementation of "to_date" to use milli-seconds instead of nano-seconds in order to achieve compliance with the values supported on Arrow's Date32 and Date64
  • Unit tests to validate the correct behavior

Are these changes tested?

Unit tests are provided in https://github.com/apache/datafusion/blob/main/datafusion/functions/src/datetime/to_date.rs

Are there any user-facing changes?

n/a

MartinKolbAtWork avatar Aug 29 '24 06:08 MartinKolbAtWork