prql
prql copied to clipboard
Undefined behaviour with ambiguous column names
In some cases, PRQL does nicely warn us whenever we're referencing a field name that could have multiple meanings:
Ambiguous reference. Could be from either of `option_a`, `option_b`
However, in others (and I haven't quite understood the difference yet), it doesn't. In this example, common_fieldname
exists both in CTEs x
and y
. At the end of the query, we select it without prefixing it with the table name, and it gets compiled to y.common_fieldname
. This runs perfectly fine in the DB, but can cause unintended behaviour – we had a query like this return data from the wrong table, as we didn't realise there was a common column name between them.
table x = (
from orig_x
select [x_id, common_fieldname]
)
table y = (
from orig_y
select [y_id, x_id, common_fieldname]
)
from x
join y [x_id]
# common_fieldname could be coming from either table now
# it just randomly happens to come from `y`
select [x_id, common_fieldname] # same issue when filtering
the compiled (partial) output is
SELECT
x,
y.common_fieldname
...but it could just as well be x.common_fieldname
. IMO, this should instead be a case where the compiler complains and asks you to be more specific.
I believe that this is caused by the same bug as #820 - common_fieldname
is deemed to be the same column in both table
definitions.
Part of the codebase that causes this will be redone in ~a week, so I won't be fixing this now.
Now with #908 — a stopgap while we redo some of the Semantic part of the compiler — this now compiles to:
WITH x AS (
SELECT
x_id,
common_fieldname
FROM
orig_x
),
y AS (
SELECT
y_id,
x_id,
common_fieldname
FROM
orig_y
)
SELECT
x.*,
y.*,
x_id
FROM
x
JOIN y USING(x_id)
...which is not as bad I think — likely the SQL engine is going to complain that there are conflicting columns.
It's not great — we can & should warn ourselves — but I hope it's OK until the Semantic rewrite @mklopets ?
This was fixed with the semantic rewrite / 0.3.0 🎉