[ENH] Add support for tables with categorical columns
Is your feature request related to a problem? Please describe. It would be nice if Dask-SQL supported tables with categorical columns; currently, when the attempting to query a table with categorical columns, we get the following stack:
import pandas as pd
from dask_sql import Context
c = Context()
df = pd.DataFrame({"cat": pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])})
c.create_table("cat", df)
c.sql("SELECT * FROM cat").compute()
NotImplementedError Traceback (most recent call last)
Input In [1], in <module>
6 df = pd.DataFrame({"cat": pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])})
8 c.create_table("cat", df)
---> 10 c.sql("SELECT * FROM cat").compute()
File ~/dask-sql/dask_sql/context.py:465, in Context.sql(self, sql, return_futures, dataframes, gpu, config_options)
462 for df_name, df in dataframes.items():
463 self.create_table(df_name, df, gpu=gpu)
--> 465 rel, select_names, _ = self._get_ral(sql)
467 dc = RelConverter.convert(rel, context=self)
469 if dc is None:
File ~/dask-sql/dask_sql/context.py:784, in Context._get_ral(self, sql)
782 """Helper function to turn the sql query into a relational algebra and resulting column names"""
783 # get the schema of what we currently have registered
--> 784 schemas = self._prepare_schemas()
786 RelationalAlgebraGeneratorBuilder = (
787 com.dask.sql.application.RelationalAlgebraGeneratorBuilder
788 )
790 # True if the SQL query should be case sensitive and False otherwise
File ~/dask-sql/dask_sql/context.py:741, in Context._prepare_schemas(self)
739 for column in df.columns:
740 data_type = df[column].dtype
--> 741 sql_data_type = python_to_sql_type(data_type)
743 table.addColumn(column, sql_data_type)
745 java_schema.addTable(table)
File ~/dask-sql/dask_sql/mappings.py:100, in python_to_sql_type(python_type)
98 return _PYTHON_TO_SQL[python_type]
99 except KeyError: # pragma: no cover
--> 100 raise NotImplementedError(
101 f"The python type {python_type} is not implemented (yet)"
102 )
NotImplementedError: The python type category is not implemented (yet)
Describe the solution you'd like In an ideal world, mapping the categorical dtype to some existing categorical dtype in SQL, so that queries would return a categorical column. If this isn't possible, it might still be desirable to get the underlying type of the categorical values and map to that SQL type (maybe presenting a warning that by doing this we're losing the categories).
Describe alternatives you've considered I gave this a try in https://github.com/dask-contrib/dask-sql/pull/402, but got blocked because Pandas and cuDF categorical dtypes aren't consistent in behavior:
import cudf
gdf = cudf.from_pandas(df)
gdf["cat"].dtype.type # numpy.object_
df["cat"].dtype.type # pandas.core.dtypes.dtypes.CategoricalDtypeType
Additional context Encountered this issue while chatting with @DaceT about how it wasn't possible to perform queries on the sample timeseries in cuDF:
import cudf
from dask_sql import Context
c = Context()
gdf = cudf.datasets.timeseries()
c.create_table("table", gdf)
c.sql("SELECT *FROM table").compute()