dask-sql [BUG] dask-sql creates all tables as lower case.

The dask create_table method adds to the list of tables with lower case which means that any queries that are executed must use the lower case name.

What happened:

All tables are registered as lower case, so I am unable to include any queries with upper case table names - which requires me to convert all queries to lower().

What you expected to happen:

I would expect that the case would be preserved when adding to schemas, or at least like in the case of #177 there would be an original and lower case verison.

Minimal Complete Verifiable Example:

See the following code snippet that creates a dask dataframe, and then creates two tables, one with lower_case name, and another with UPPER_CASE.

from dask_sql import Context
c = Context()

ddf = dd.from_pandas(pd.DataFrame({"key": ["value"]}), npartitions=1)

# create table with lower case
c.create_table("lower_case", ddf)
print(c.schema["root"].tables) 
c.sql("SELECT * FROM lower_case")

# Create table with upper case
c.create_table("UPPER_CASE", ddf)
print(c.schema["root"].tables) 
c.sql("SELECT * FROM UPPER_CASE")

The second call to print the tables lists both as lower case:

{'lower_case': <dask_sql.datacontainer.DataContainer object at 0x2d8432970>, 'upper_case': <dask_sql.datacontainer.DataContainer object at 0x2d49d9670>}

And the second select fails to find the table with upper case

ParsingException: Can not parse the given SQL: From line 1, column 15 to line 1, column 24: Object 'UPPER_CASE' not found within 'root'; did you mean 'upper_case'?

Anything else we need to know?:

Perhaps there is a reason why everything was made lower() - but I can't seem to find it in the docs.

Environment:

dask-sql version: 2022.4.1
Python version: 3.9
Operating System: mac-osx arm64
Install method (conda, pip, source): conda

Apr 20 '22 00:04 brightsparc

Perhaps there is a reason why everything was made lower()

Though it wasn't my decision, I would imagine this is to be in line with Postgres (the SQL engine which we primarily test against)? Spinning up a DB here, I see that we aren't able to create two distinct tables named new_table and NEW_TABLE, as both of these map to the all lowercase new_table within the schema.

Perhaps the better change here is to reconsider the default value of True we've set for sql.identifier.case_sensitive - it seemed like table names weren't case sensitive on the DB I spun up, though by default they are for Dask-SQL. For your specific case, the following works:

import dask.config

with dask.config.set({"sql.identifier.case_sensitive": False}):
    c.sql("SELECT * FROM UPPER_CASE")

But maybe we should consider making that the default behavior (cc @ayushdg in case you have thoughts around this)

Apr 20 '22 02:04 charlesbluca

Perhaps there is a reason why everything was made lower()

Though it wasn't my decision, I would imagine this is to be in line with Postgres (the SQL engine which we primarily test against)? Spinning up a DB here, I see that we aren't able to create two distinct tables named new_table and NEW_TABLE, as both of these map to the all lowercase new_table within the schema.

Perhaps the better change here is to reconsider the default value of True we've set for sql.identifier.case_sensitive - it seemed like table names weren't case sensitive on the DB I spun up, though by default they are for Dask-SQL. For your specific case, the following works:
import dask.config

with dask.config.set({"sql.identifier.case_sensitive": False}):
    c.sql("SELECT * FROM UPPER_CASE")
But maybe we should consider making that the default behavior (cc @ayushdg in case you have thoughts around this)

Thanks making dask case insensitive solves the issue from a query stand point.

FWIW, I think the calcite default dialect is Casing.TO_UPPER for most products, with Postgresql being the exception with Casing.TO_LOWER for unquoted casing. However any quoted identifier should be preserved, so in that case it would be useful to allow registering a table name as it was given.

Apr 20 '22 04:04 brightsparc

However any quoted identifier should be preserved, so in that case it would be useful to allow registering a table name as it was given.

I generally agree with this sentiment if this is the expected behavior for Postgres, though I think more consideration needs to be made on if / how we should shift our case handling to this style, as this is a large change from our current handling (which essentially ignores quotes when parsing table names and uses sql.identifier.case_sensitive to decide whether casing should be respected) - I will get back to this issue and #482 after some internal sync around what the best path forward here is.

Also I will note that we are currently exploring an overhaul of our current SQL parsing machinery from Calcite to DataFusion (check out https://github.com/dask-contrib/dask-sql/tree/datafusion-sql-planner and some of the recently opened datafusion issues for some additional context). Thus, while we can discuss here what an ideal casing behavior might be for dask-sql, I wouldn't anticipate any solution being long term until we have fully fleshed out and merged the changes developing in that branch.

Apr 20 '22 15:04 charlesbluca

Thanks for this insight @charlesbluca. I did investigate using sql.identifier.case_sensitive and had some success in validating that this allowed me to work around this issue. However in the last day, I upgraded my dependencies and found that this was now not being respect. I wonder if you know of any issues that might interface with dask-sql with respect to case sensitivity. (I upgrade llvm and numba but not sure these are dependencies that would affect this project).

Thanks for sharing the work on DaskFusion, this does look interesting, however I will be sticking to using Calcite parser in my stack so hopefully this will remain consistent in turns of the support for parser syntax if you switch over to this solution. Not having a Java dependency and using the jpype1 interface.

Apr 28 '22 23:04 brightsparc

However in the last day, I upgraded my dependencies and found that this was now not being respect.

To followup does this mean that in a newer environment the initial example posted still has issues even when case_sensitive is set to False?

however I will be sticking to using Calcite parser in my stack so hopefully this will remain consistent in turns of the support for parser syntax if you switch over to this solution.

Are you able to elaborate a bit further on this? Is the expectation here for dask-sql to be able to execute sql queries that the calcite parser understands, or is it to be able to directly accept a calcite logical plan object?

May 02 '22 19:05 ayushdg