prql icon indicating copy to clipboard operation
prql copied to clipboard

Bug in how we handle joins after aggregates

Open ajfriend opened this issue 3 years ago • 4 comments
trafficstars

I get an error when running the example at https://pypi.org/project/prql-python/

Code

import prql_python as prql

prql_query = """
    from employees
    join salaries [emp_id]
    group [emp_id, gender] (
      aggregate [
        avg_salary = average salary
      ]
    )
    join departments [dept_id]
"""

sql = prql.to_sql(prql_query)

Error

In [7]: prql.to_sql(prql_query)
Traceback (most recent call last):

  File "/Users/ajfriend/work/env/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3444, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-7-288067862fa2>", line 1, in <module>
    prql.to_sql(prql_query)

  File "<string>", line unknown
SyntaxError: Error { span: Some(Span { start: 166, end: 173 }), reason: Simple("join using a column name must belong to both tables"), help: None }

However, JSON seems to work

On the other hand, things seem to work properly when I run prql.to_json(prql_query)

ajfriend avatar Jul 09 '22 21:07 ajfriend

Thanks @ajfriend .

That does look like a bug; I'm not sure whether it's a regression or it never worked.

I'm going to remove the final line so we don't have an incorrect example, and change this into a compiler bug

max-sixty avatar Jul 09 '22 21:07 max-sixty

Python example removed by https://github.com/prql/prql/pull/787

max-sixty avatar Jul 09 '22 21:07 max-sixty

I don't think this error that's thrown in this case (on the final line of the PRQL) is related to joining at all – it's that any columns other than emp_id, gender and avg_salary get dropped with the grouping + aggregation.

Using dept_id in any way after that operation is an invalid operation.

But, what's buggy here is that the error doesn't go away once we add dept_id to the group clause like this:

-group [emp_id, gender] (
+group [emp_id, dept_id, gender] (

but, re-defining dept_id using a derive prior to the first join does work:

from employees
+derive dept_id = dept_id
join salaries [emp_id]
+group [emp_id, dept_id, gender] (
-group [emp_id, gender] (
  aggregate [
    avg_salary = average salary
  ]
)
join departments [dept_id]

as far as PRQL is concerned, adding dept_id to group isn't actually required, BUT without that, the query would actually fail because the grouping swallows it:

Query 1 ERROR: ERROR:  column "dept_id" specified in USING clause does not exist in left table

mklopets avatar Jul 18 '22 22:07 mklopets

(just to note, the example that @mklopets gave above is still an error after #908 )

max-sixty avatar Aug 06 '22 07:08 max-sixty

Here's a minimal example:

from employees
join salaries [emp_id]  # Works without this line
group [dept_id] (
  aggregate [avg_salary = average salary]
)
select [dept_id]
Error: 
   ╭─[:6:9]
   │
 6 │ select [dept_id]
   ·         ───┬───  
   ·            ╰───── Unknown variable `dept_id`
───╯

max-sixty avatar Oct 13 '22 23:10 max-sixty

This now works:

from employees
join salaries [~emp_id]  # Works without this line
group [dept_id] (
  aggregate [avg_salary = average salary]
)
select [dept_id]
SELECT
  dept_id
FROM
  employees
  JOIN salaries ON employees.emp_id = salaries.emp_id
GROUP BY
  dept_id

Will be released with 0.3

aljazerzen avatar Nov 28 '22 10:11 aljazerzen