[QST] NotImplementedError: The python type string is not implemented (yet)
What is your question?
I keep getting this error when trying to query a table created from dask dataframe reading a csv file. A couple of columns in the csv file are strings. I've tried multiple ways to convert the pyarrow string type but none of them worked and the type remained unchanged. How should I proceed?
df = dd.read_csv("../sales.csv")
print(df.dtypes)
c = Context()
c.create_table("sales", df)
result = c.sql("SELECT * FROM sales").compute()
print(result)
/ArrowFlightService/lib/python3.9/site-packages/dask_sql/mappings.py", line 120, in python_to_sql_type raise NotImplementedError( NotImplementedError: The python type string is not implemented (yet)
Thanks for raising the issue @luzhengyang. Could you also share the dask and dask-sql versions you're using in this example?
My assumption here is that we're getting bitten by Dask's eager conversion of object columns to pyarrow strings, which we haven't be able to fully support yet (working on this in #1220); are you able to disable this eager conversion with dask.config.set({"dataframe.convert-string": False})? Would be interested in if that unblocks things here for you
As discussed in Discourse, the basic documentation example reproduces this error, but disabling eager conversion fixes it.
import dask.datasets
df = dask.datasets.timeseries()
from dask_sql import Context
c = Context()
c.create_table("timeseries", df, persist=True)
result = c.sql("""
SELECT
name, SUM(x) AS "sum"
FROM timeseries
WHERE x > 0.5
GROUP BY name
""")
result.compute()
For now I've disabled eager string conversion in #1260 so that users aren't hit by this breakage by default
can use with PY3.8.19 version, I encounter the above issues when using version 3.9 dask 2023.5.0 dask_sql 2023.11.0