polars
polars copied to clipboard
Correctness issue: Having clause not honored in SQL statement
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
if __name__ == "__main__":
df = pl.DataFrame(
{'region': [1, 1, 2]},
)
#print(df)
ctx = pl.SQLContext(register_globals=False, eager_execution=True)
ctx.register("df", df)
print(ctx.execute("select region, count(*) from df group by region having count(*) > 1"))
To my surprise this prints out both regions (1 and 2.. Even though region 2 has only 1 element)
shape: (2, 2)
┌────────┬─────┐
│ region ┆ len │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞════════╪═════╡
│ 1 ┆ 2 │
│ 2 ┆ 1 │
└────────┴─────┘
Log output
No response
Issue description
The having clause is not honored.
Expected behavior
Only region 1 is printed
Installed versions
--------Version info---------
Polars: 0.20.22
Index type: UInt32
Platform: macOS-14.4.1-arm64-arm-64bit
Python: 3.9.15 (main, Apr 20 2024, 22:00:10)
[Clang 15.0.0 (clang-1500.3.9.4)]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fastexcel: <not installed>
fsspec: <not installed>
gevent: <not installed>
hvplot: <not installed>
matplotlib: <not installed>
nest_asyncio: <not installed>
numpy: 1.26.4
openpyxl: <not installed>
pandas: 2.2.2
pyarrow: <not installed>
pydantic: 2.7.0
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: 2.0.29
xlsx2csv: <not installed>
xlsxwriter: 3.2.0
Ahh... we require the having clause to reference named select columns here as we apply the constraint as a post-aggregation step (this is in line with MySQL's take on the "having" clause, though we prefer targeting PostgreSQL behaviour wherever possible).
This formulation will work correctly:
select region, count(*) as n from df group by region having n > 1
# ┌────────┬─────┐
# │ region ┆ n │
# │ --- ┆ --- │
# │ i64 ┆ u32 │
# ╞════════╪═════╡
# │ 1 ┆ 2 │
# └────────┴─────┘
I'll have to see if we can improve this (either by raising a suitable error, or automatically substituting to achieve the same effect) 🤔
I assume that adding this to the documentation would be helpful as well
I assume that adding this to the documentation would be helpful as well
Yup; we also need a total overhaul of our SQL documentation :))
Just ran the following on a mysql (mysql/8.3.0_1) instance:
with data as (
select 1 as region
union all
select 1 as region
union all
select 2 as region
)
select region, count(*) as region_count
from data
group by region
having count(*) > 1;
+--------+--------------+
| region | region_count |
+--------+--------------+
| 1 | 2 |
+--------+--------------+
1 row in set (0.00 sec)