sqlfluff icon indicating copy to clipboard operation
sqlfluff copied to clipboard

BigQuery: Accessing `STRUCT` elements evades triggering L027

Open dmohns opened this issue 3 years ago • 5 comments
trafficstars

Search before asking

  • [X] I searched the issues and found no similar issues.

What Happened

Accessing unreferenced STRUCT elements using BigQuery dot notation in a multi table query does not trigger L027.

Expected Behaviour

L027 gets triggered.

Observed Behaviour

L027 does not get triggered.

How to reproduce

SELECT
    t1.col1,
    t2.col2,
    events.id
FROM t_table1 AS t1
LEFT JOIN t_table2 AS t2
    ON TRUE

Dialect

BigQUery

Version

0.11.2 using online.sqlfluff.com

Configuration

N/A

Are you willing to work on and submit a PR to address the issue?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

dmohns avatar Apr 04 '22 13:04 dmohns

This is tricky.

Basicaly L026 works to make sure qualified columns only use tables in the from clause. This doesn’t really work for STRUCTs as impossible to know if it’s a qualified column or a STRUCT, so is off by default for languages that support them - like BigQuery.

L027 works to make sure columns are qualified for multi-table joins (i.e. have at least one dot). But it doesn’t check the qualifiers are valid - that’s L026’s job, which as I say is off by default for BigQuery.

tunetheweb avatar Apr 04 '22 19:04 tunetheweb

This is tricky.

Basicaly L026 works to make sure qualified columns only use tables in the from clause. This doesn’t really work for STRUCTs as impossible to know if it’s a qualified column or a STRUCT, so is off by default for languages that support them - like BigQuery.

L027 works to make sure columns are qualified for multi-table joins (i.e. have at least one dot). But it doesn’t check the qualifiers are valid - that’s L026’s job, which as I say is off by default for BigQuery.

But my understanding of why L026 is turned off for languages with structs is because of false positives. I.e. for

SELECT
    col1,
    a.col2

it is not clear a-priori whether a. is table alias (in which case L026 should trigger because of mixed references) or a is in fact a struct and L026 should not trigger.

However, in L027 we are in a slightly different scenario, as we do not need to detect whether we are in a mixed scenario or not. We already know, that we want every column to have a qualified reference. So, the task (as far as L027 is concerned) becomes checking whether each reference is a) qualified and b) valid (exists in the list of table aliases).

I have been tinkering around with this a little bit, I'll send a PR where we can discuss this further in a moment.

dmohns avatar Apr 04 '22 20:04 dmohns

You are correct. It was just that, other than for BigQuery, it was not really necessary to test for b) (valid) since the combination of L026 and L027 would effectively test for that.

It does feel like a little bit of duplication to do this test in both L026 and L027.

It also means any invalid references will flag in both L026 and L027 for the same error.

tunetheweb avatar Apr 04 '22 20:04 tunetheweb

You are correct. It was just that, other than for BigQuery, it was not really necessary to test for b) (valid) since the combination of L026 and L027 would effectively test for that.

It does feel like a little bit of duplication to do this test in both L026 and L027.

It also means any invalid references will flag in both L026 and L027 for the same error.

I was actually talking about L028 not L026 🤦 but the argument is the same.

I agree that duplicated errors would be very odd for non-BigQuery users. What do you think about making the check for valid references a configurable option to L027 which is disabled by default?

This way BigQuery users could benefit from the additional check while other do not get disturbed. Code-wise it doesn't seem too much of duplication, as we'd be reusing most of L020's work (although I think it only covers SELECT)

dmohns avatar Apr 04 '22 21:04 dmohns

So here's some SQL:

SELECT
   t3.col1,
   col2
FROM
  table1 t1
JOIN
  table2 t2
ON t1.col1 = t2.col1

This would potentially flag:

  • L026 - for t3.col1 (but not for bigquery by default)
  • L027 - for t3.col1 (but only because there are two or more tables otherwise it wouldn't flag here)
  • L027 for col2
  • L028 for t3.col1 and col2, offering to fix col2

Maybe that's OK though since they often won't flag together? And luckily only L028 is fix compatible so at least L026 and L027 won't both try to fix at same time.

Will give it some more thought...

tunetheweb avatar Apr 04 '22 23:04 tunetheweb