dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

[BUG] Projections map `cast` operations to original column name

Open charlesbluca opened this issue 3 years ago • 0 comments

What happened: When attempting to project a column that has been casted to a different dtype, unexpected behavior can occur due to the fact that DataFusion seems to map cast operations to the name of the original column (e.g. the key for cast(df.a to date) would be df.a).

In particular, this can cause significant issues when trying to project both a casted column and the original, as this results in a collision in our named projects, causing us to use the same column for both projects.

What you expected to happen: I would've expected cast operations to be mapped to some alias that would distinguish them from the original column, such that collisions wouldn't occur here.

Minimal Complete Verifiable Example:

We get parsing issues when trying to project the casted and original column without an alias:

import pandas as pd
from dask_sql import Context

df = pd.DataFrame({"a": ["1999-06-21"]})

c = Context()
c.create_table("df", df)

c.sql("""
    select
        a,
        cast(a as date)
    from df 
""")

# ParsingException: Plan("Projections require unique expression names but the expression \"df.a\" at position 0 and \"CAST(df.a AS Date32)\" at position 1 have the same name. Consider aliasing (\"AS\") one of them.")

When using an alias, we see that one column is used for both projects:

c.sql("""
    select
        a,
        cast(a as date) as b
    from df 
""")

# Dask DataFrame Structure:
#                             a               b
# npartitions=1                                
# 0              datetime64[ns]  datetime64[ns]
# 0                         ...             ...
# Dask Name: rename, 15 graph layers

Anything else we need to know?: I'm fairly sure this is the underlying issue behind failures we were seeing in q21 and q40 before merging in #924, as the failures seemed to indicate that a cast column wasn't the expected dtype (cc @ayushdg).

Environment:

  • dask-sql version: latest
  • Python version: 3.9
  • Operating System: ubuntu
  • Install method (conda, pip, source): source

charlesbluca avatar Dec 01 '22 20:12 charlesbluca