dask-sql
dask-sql copied to clipboard
[BUG] dask-sql creates all tables as lower case.
The dask create_table
method adds to the list of tables with lower case which means that any queries that are executed must use the lower case name.
What happened:
All tables are registered as lower case, so I am unable to include any queries with upper case table names - which requires me to convert all queries to lower()
.
What you expected to happen:
I would expect that the case would be preserved when adding to schemas, or at least like in the case of #177 there would be an original and lower case verison.
Minimal Complete Verifiable Example:
See the following code snippet that creates a dask dataframe, and then creates two tables, one with lower_case
name, and another with UPPER_CASE
.
from dask_sql import Context
c = Context()
ddf = dd.from_pandas(pd.DataFrame({"key": ["value"]}), npartitions=1)
# create table with lower case
c.create_table("lower_case", ddf)
print(c.schema["root"].tables)
c.sql("SELECT * FROM lower_case")
# Create table with upper case
c.create_table("UPPER_CASE", ddf)
print(c.schema["root"].tables)
c.sql("SELECT * FROM UPPER_CASE")
The second call to print the tables lists both as lower case:
{'lower_case': <dask_sql.datacontainer.DataContainer object at 0x2d8432970>, 'upper_case': <dask_sql.datacontainer.DataContainer object at 0x2d49d9670>}
And the second select fails to find the table with upper case
ParsingException: Can not parse the given SQL: From line 1, column 15 to line 1, column 24: Object 'UPPER_CASE' not found within 'root'; did you mean 'upper_case'?
Anything else we need to know?:
Perhaps there is a reason why everything was made lower()
- but I can't seem to find it in the docs.
Environment:
- dask-sql version:
2022.4.1
- Python version:
3.9
- Operating System:
mac-osx arm64
- Install method (conda, pip, source):
conda
Perhaps there is a reason why everything was made
lower()
Though it wasn't my decision, I would imagine this is to be in line with Postgres (the SQL engine which we primarily test against)? Spinning up a DB here, I see that we aren't able to create two distinct tables named new_table
and NEW_TABLE
, as both of these map to the all lowercase new_table
within the schema.
Perhaps the better change here is to reconsider the default value of True we've set for sql.identifier.case_sensitive
- it seemed like table names weren't case sensitive on the DB I spun up, though by default they are for Dask-SQL. For your specific case, the following works:
import dask.config
with dask.config.set({"sql.identifier.case_sensitive": False}):
c.sql("SELECT * FROM UPPER_CASE")
But maybe we should consider making that the default behavior (cc @ayushdg in case you have thoughts around this)
Perhaps there is a reason why everything was made
lower()
Though it wasn't my decision, I would imagine this is to be in line with Postgres (the SQL engine which we primarily test against)? Spinning up a DB here, I see that we aren't able to create two distinct tables named
new_table
andNEW_TABLE
, as both of these map to the all lowercasenew_table
within the schema.Perhaps the better change here is to reconsider the default value of True we've set for
sql.identifier.case_sensitive
- it seemed like table names weren't case sensitive on the DB I spun up, though by default they are for Dask-SQL. For your specific case, the following works:import dask.config with dask.config.set({"sql.identifier.case_sensitive": False}): c.sql("SELECT * FROM UPPER_CASE")
But maybe we should consider making that the default behavior (cc @ayushdg in case you have thoughts around this)
Thanks making dask case insensitive solves the issue from a query stand point.
FWIW, I think the calcite default dialect is Casing.TO_UPPER
for most products, with Postgresql being the exception with Casing.TO_LOWER
for unquoted casing. However any quoted identifier should be preserved, so in that case it would be useful to allow registering a table name as it was given.
However any quoted identifier should be preserved, so in that case it would be useful to allow registering a table name as it was given.
I generally agree with this sentiment if this is the expected behavior for Postgres, though I think more consideration needs to be made on if / how we should shift our case handling to this style, as this is a large change from our current handling (which essentially ignores quotes when parsing table names and uses sql.identifier.case_sensitive
to decide whether casing should be respected) - I will get back to this issue and #482 after some internal sync around what the best path forward here is.
Also I will note that we are currently exploring an overhaul of our current SQL parsing machinery from Calcite to DataFusion (check out https://github.com/dask-contrib/dask-sql/tree/datafusion-sql-planner and some of the recently opened datafusion
issues for some additional context). Thus, while we can discuss here what an ideal casing behavior might be for dask-sql, I wouldn't anticipate any solution being long term until we have fully fleshed out and merged the changes developing in that branch.
Thanks for this insight @charlesbluca. I did investigate using sql.identifier.case_sensitive
and had some success in validating that this allowed me to work around this issue. However in the last day, I upgraded my dependencies and found that this was now not being respect. I wonder if you know of any issues that might interface with dask-sql with respect to case sensitivity. (I upgrade llvm
and numba
but not sure these are dependencies that would affect this project).
Thanks for sharing the work on DaskFusion, this does look interesting, however I will be sticking to using Calcite parser in my stack so hopefully this will remain consistent in turns of the support for parser syntax if you switch over to this solution. Not having a Java dependency and using the jpype1
interface.
However in the last day, I upgraded my dependencies and found that this was now not being respect.
To followup does this mean that in a newer environment the initial example posted still has issues even when case_sensitive
is set to False?
however I will be sticking to using Calcite parser in my stack so hopefully this will remain consistent in turns of the support for parser syntax if you switch over to this solution.
Are you able to elaborate a bit further on this? Is the expectation here for dask-sql
to be able to execute sql queries that the calcite parser understands, or is it to be able to directly accept a calcite logical plan object?