[ENH] Add handling for differently formatted datetimes
Is your feature request related to a problem? Please describe.
Currently, our support for scalar datetimes assumes that they are formatted YYYY-MM-DD; when this isn't the case, such as in the following example:
import numpy as np
import pandas as pd
from dask_sql import Context
c = Context()
c.create_table("df", pd.DataFrame({"d": np.arange("2001-03", "2001-05", dtype="datetime64[D]")}))
c.sql("select * from df where d > '2001-4-01'")
We run into optimizer failures:
WARNING:datafusion_optimizer.optimizer:Skipping optimizer rule 'simplify_expressions' due to unexpected error: Arrow error: Parser error: Error parsing timestamp from '2001-4-01': timestamp must contain at least 10 characters
Before ultimately failing to create a numpy datetime from the string:
File ~/dev/dask-sql/fix-gpuci-channel-priority/dask_sql/mappings.py:194, in sql_to_python_value(sql_type, literal_value)
188 elif (
189 sql_type == SqlTypeName.TIMESTAMP
190 or sql_type == SqlTypeName.TIME
191 or sql_type == SqlTypeName.DATE
192 ):
193 if isinstance(literal_value, str):
--> 194 literal_value = np.datetime64(literal_value)
195 elif str(literal_value) == "None":
196 # NULL time
197 return pd.NaT # pragma: no cover
ValueError: Error parsing datetime string "2001-4-01" at position 5
Describe the solution you'd like It would be nice if we had handling for irregularly formatted datetime scalars, so that queries like this would pass, with a nice secondary goal of getting optimizations working as well.
Describe alternatives you've considered The obvious alternative here would be reformatting the query to use regularly formatted datetime scalars; however, edge cases like this have come up in our query testing.
Additional context It looks like #1146 should address this issue on the Python end, interested in if there's anything we could do on the Rust end to catch these scalars and normalize their formatting before running through optimizations? Though this might make more sense as a change upstream to datafusion[-python].