prql icon indicating copy to clipboard operation
prql copied to clipboard

Undefined behaviour with ambiguous column names

Open mklopets opened this issue 2 years ago • 2 comments

In some cases, PRQL does nicely warn us whenever we're referencing a field name that could have multiple meanings:

Ambiguous reference. Could be from either of `option_a`, `option_b`

However, in others (and I haven't quite understood the difference yet), it doesn't. In this example, common_fieldname exists both in CTEs x and y. At the end of the query, we select it without prefixing it with the table name, and it gets compiled to y.common_fieldname. This runs perfectly fine in the DB, but can cause unintended behaviour – we had a query like this return data from the wrong table, as we didn't realise there was a common column name between them.

table x = (
  from orig_x
  select [x_id, common_fieldname]
)

table y = (
  from orig_y
  select [y_id, x_id, common_fieldname]
)

from x
join y [x_id]

# common_fieldname could be coming from either table now
# it just randomly happens to come from `y`
select [x_id, common_fieldname] # same issue when filtering

the compiled (partial) output is

SELECT
  x,
  y.common_fieldname

...but it could just as well be x.common_fieldname. IMO, this should instead be a case where the compiler complains and asks you to be more specific.

mklopets avatar Jul 31 '22 13:07 mklopets

I believe that this is caused by the same bug as #820 - common_fieldname is deemed to be the same column in both table definitions.

Part of the codebase that causes this will be redone in ~a week, so I won't be fixing this now.

aljazerzen avatar Aug 04 '22 07:08 aljazerzen

Now with #908 — a stopgap while we redo some of the Semantic part of the compiler — this now compiles to:

        WITH x AS (
          SELECT
            x_id,
            common_fieldname
          FROM
            orig_x
        ),
        y AS (
          SELECT
            y_id,
            x_id,
            common_fieldname
          FROM
            orig_y
        )
        SELECT
          x.*,
          y.*,
          x_id
        FROM
          x
          JOIN y USING(x_id)

...which is not as bad I think — likely the SQL engine is going to complain that there are conflicting columns.

It's not great — we can & should warn ourselves — but I hope it's OK until the Semantic rewrite @mklopets ?

max-sixty avatar Aug 06 '22 03:08 max-sixty

This was fixed with the semantic rewrite / 0.3.0 🎉

mklopets avatar Dec 04 '22 00:12 mklopets