prql
prql copied to clipboard
Bug in how we handle joins after aggregates
I get an error when running the example at https://pypi.org/project/prql-python/
Code
import prql_python as prql
prql_query = """
from employees
join salaries [emp_id]
group [emp_id, gender] (
aggregate [
avg_salary = average salary
]
)
join departments [dept_id]
"""
sql = prql.to_sql(prql_query)
Error
In [7]: prql.to_sql(prql_query)
Traceback (most recent call last):
File "/Users/ajfriend/work/env/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3444, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-7-288067862fa2>", line 1, in <module>
prql.to_sql(prql_query)
File "<string>", line unknown
SyntaxError: Error { span: Some(Span { start: 166, end: 173 }), reason: Simple("join using a column name must belong to both tables"), help: None }
However, JSON seems to work
On the other hand, things seem to work properly when I run prql.to_json(prql_query)
Thanks @ajfriend .
That does look like a bug; I'm not sure whether it's a regression or it never worked.
I'm going to remove the final line so we don't have an incorrect example, and change this into a compiler bug
Python example removed by https://github.com/prql/prql/pull/787
I don't think this error that's thrown in this case (on the final line of the PRQL) is related to joining at all – it's that any columns other than emp_id, gender and avg_salary get dropped with the grouping + aggregation.
Using dept_id in any way after that operation is an invalid operation.
But, what's buggy here is that the error doesn't go away once we add dept_id to the group clause like this:
-group [emp_id, gender] (
+group [emp_id, dept_id, gender] (
but, re-defining dept_id using a derive prior to the first join does work:
from employees
+derive dept_id = dept_id
join salaries [emp_id]
+group [emp_id, dept_id, gender] (
-group [emp_id, gender] (
aggregate [
avg_salary = average salary
]
)
join departments [dept_id]
as far as PRQL is concerned, adding dept_id to group isn't actually required, BUT without that, the query would actually fail because the grouping swallows it:
Query 1 ERROR: ERROR: column "dept_id" specified in USING clause does not exist in left table
(just to note, the example that @mklopets gave above is still an error after #908 )
Here's a minimal example:
from employees
join salaries [emp_id] # Works without this line
group [dept_id] (
aggregate [avg_salary = average salary]
)
select [dept_id]
Error:
╭─[:6:9]
│
6 │ select [dept_id]
· ───┬───
· ╰───── Unknown variable `dept_id`
───╯
This now works:
from employees
join salaries [~emp_id] # Works without this line
group [dept_id] (
aggregate [avg_salary = average salary]
)
select [dept_id]
SELECT
dept_id
FROM
employees
JOIN salaries ON employees.emp_id = salaries.emp_id
GROUP BY
dept_id
Will be released with 0.3