dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

[BUG] Case sensitive=True config doesn't work as intended

Open ayushdg opened this issue 2 years ago • 1 comments

What happened: When setting sql.identifier.case_sensitive=True dask-sql still ends up converting identifiers to lowercase during the planning stage.

What you expected to happen: Case sensitivity being honored when set to True which is the default. This needs https://github.com/apache/arrow-datafusion/issues/5626 to go in after which the planner can be instantiated with the proper enable_ident_normalization value. https://github.com/dask-contrib/dask-sql/blob/11fda749d4dc4c729350a1a1e87c0ea631f11bce/dask_planner/src/sql.rs#L548 Minimal Complete Verifiable Example:

df = pd.DataFrame({"id": [0, 1], "VAL": [1, 2]})
c.create_table("test", df)
c.sql("select ID from test") # works even though ID should be treated differently to `id`
c.sql("select val from test") # fails since the schema has `VAL` but the planner is looking for `val`.

ayushdg avatar Mar 17 '23 21:03 ayushdg

Another instance of this reported by @goodwanghan in https://github.com/dask-contrib/dask-sql/issues/1108#issue-1653305345

import pandas as pd
import dask.dataframe as dd
from dask_sql import Context

df = pd.read_parquet("https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-01.parquet")

ctx = Context()
res = ctx.sql(
    "SELECT PULocationID, COUNT(*) AS ct FROM df GROUP BY PULocationID ORDER BY ct DESC LIMIT 5",
    dataframes={"df":dd.from_pandas(df,npartitions=2)},
    config_options={"sql.identifier.case_sensitive":True}
)

ayushdg avatar Apr 05 '23 16:04 ayushdg