polars icon indicating copy to clipboard operation
polars copied to clipboard

Correctness issue: Having clause not honored in SQL statement

Open timvw opened this issue 1 year ago • 4 comments
trafficstars

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

if __name__ == "__main__":

    df = pl.DataFrame(
        {'region': [1, 1, 2]},
    )
    #print(df)

    ctx = pl.SQLContext(register_globals=False, eager_execution=True)
    ctx.register("df", df)
    print(ctx.execute("select region, count(*) from df group by region having count(*) > 1"))

To my surprise this prints out both regions (1 and 2.. Even though region 2 has only 1 element)

shape: (2, 2)
┌────────┬─────┐
│ region ┆ len │
│ ---    ┆ --- │
│ i64    ┆ u32 │
╞════════╪═════╡
│ 1      ┆ 2   │
│ 2      ┆ 1   │
└────────┴─────┘

Log output

No response

Issue description

The having clause is not honored.

Expected behavior

Only region 1 is printed

Installed versions

--------Version info---------
Polars:               0.20.22
Index type:           UInt32
Platform:             macOS-14.4.1-arm64-arm-64bit
Python:               3.9.15 (main, Apr 20 2024, 22:00:10) 
[Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              <not installed>
pydantic:             2.7.0
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.29
xlsx2csv:             <not installed>
xlsxwriter:           3.2.0

timvw avatar Apr 29 '24 09:04 timvw

Ahh... we require the having clause to reference named select columns here as we apply the constraint as a post-aggregation step (this is in line with MySQL's take on the "having" clause, though we prefer targeting PostgreSQL behaviour wherever possible).

This formulation will work correctly:

select region, count(*) as n from df group by region having n > 1
# ┌────────┬─────┐
# │ region ┆ n   │
# │ ---    ┆ --- │
# │ i64    ┆ u32 │
# ╞════════╪═════╡
# │ 1      ┆ 2   │
# └────────┴─────┘

I'll have to see if we can improve this (either by raising a suitable error, or automatically substituting to achieve the same effect) 🤔

alexander-beedie avatar Apr 29 '24 14:04 alexander-beedie

I assume that adding this to the documentation would be helpful as well

timvw avatar Apr 29 '24 17:04 timvw

I assume that adding this to the documentation would be helpful as well

Yup; we also need a total overhaul of our SQL documentation :))

alexander-beedie avatar Apr 30 '24 06:04 alexander-beedie

Just ran the following on a mysql (mysql/8.3.0_1) instance:

with data as (
  select 1 as region
  union all
  select 1 as region
  union all 
  select 2 as region
)
select region, count(*) as region_count
from data
group by region
having count(*) > 1;
+--------+--------------+
| region | region_count |
+--------+--------------+
|      1 |            2 |
+--------+--------------+
1 row in set (0.00 sec)

timvw avatar Apr 30 '24 07:04 timvw