dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

[ENH] Add handling for differently formatted datetimes

Open charlesbluca opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe. Currently, our support for scalar datetimes assumes that they are formatted YYYY-MM-DD; when this isn't the case, such as in the following example:

import numpy as np
import pandas as pd

from dask_sql import Context

c = Context()

c.create_table("df", pd.DataFrame({"d": np.arange("2001-03", "2001-05", dtype="datetime64[D]")}))

c.sql("select * from df where d > '2001-4-01'")

We run into optimizer failures:

WARNING:datafusion_optimizer.optimizer:Skipping optimizer rule 'simplify_expressions' due to unexpected error: Arrow error: Parser error: Error parsing timestamp from '2001-4-01': timestamp must contain at least 10 characters

Before ultimately failing to create a numpy datetime from the string:

File ~/dev/dask-sql/fix-gpuci-channel-priority/dask_sql/mappings.py:194, in sql_to_python_value(sql_type, literal_value)
    188 elif (
    189     sql_type == SqlTypeName.TIMESTAMP
    190     or sql_type == SqlTypeName.TIME
    191     or sql_type == SqlTypeName.DATE
    192 ):
    193     if isinstance(literal_value, str):
--> 194         literal_value = np.datetime64(literal_value)
    195     elif str(literal_value) == "None":
    196         # NULL time
    197         return pd.NaT  # pragma: no cover

ValueError: Error parsing datetime string "2001-4-01" at position 5

Describe the solution you'd like It would be nice if we had handling for irregularly formatted datetime scalars, so that queries like this would pass, with a nice secondary goal of getting optimizations working as well.

Describe alternatives you've considered The obvious alternative here would be reformatting the query to use regularly formatted datetime scalars; however, edge cases like this have come up in our query testing.

Additional context It looks like #1146 should address this issue on the Python end, interested in if there's anything we could do on the Rust end to catch these scalars and normalize their formatting before running through optimizations? Though this might make more sense as a change upstream to datafusion[-python].

charlesbluca avatar May 18 '23 15:05 charlesbluca