dbt-dry-run icon indicating copy to clipboard operation
dbt-dry-run copied to clipboard

Dry run fails on models whose fields explicitly reference dependency models

Open alvfermar-mp opened this issue 2 years ago • 3 comments

Take the model

SELECT 
    field1
    , table1.field2
    , field3
FROM table1

Since the reference to table1 is being replaced by (SELECT "some string" as field2, 42 as field2 and True as field3), if we don't alias that select literal with the name of the reference (ie. AS table1), the query will fail to execute. This could be solved by simple adding that bit to the select literal

alvfermar-mp avatar Jul 25 '22 12:07 alvfermar-mp

Following on from the discussions in PR #13

There are several cases we need to be able to cover:

  1. No aliasing, implicit selection of columns
SELECT
    first_name
FROM `at-data-platform-preprod`.`training`.`employee`;

-- Converts to

SELECT
    first_name
FROM (SELECT 'some-string' as first_name)
  1. Aliasing with AS
SELECT
    emp.first_name
FROM `at-data-platform-preprod`.`training`.`employee` as emp;

-- Converts to

SELECT
    emp.first_name
FROM (SELECT 'some-string' as first_name) as emp;
  1. Aliasing without AS:
SELECT
    emp.first_name
FROM `at-data-platform-preprod`.`training`.`employee` emp;

-- Converts to

SELECT
    emp.first_name
FROM (SELECT 'some-string' as first_name) emp;
  1. Using the table name explicitly
SELECT
    employee.first_name
FROM `at-data-platform-preprod`.`training`.`employee`;

-- Converts to

SELECT
    employee.first_name
FROM (SELECT 'some-string' as first_name) employee; -- Problem: Currently we don't include this 'employee' alias so the dry run fails

The current implementation works for cases 1 - 3 but fails for case 4. We need to detect for each model if after the table reference there is an alias. If there is it should be preserved if there is not it should put the name of the table as the alias. So for case 1 we now get:

SELECT
    first_name
FROM `at-data-platform-preprod`.`training`.`employee`;

-- Converts to

SELECT
    first_name
FROM (SELECT 'some-string' as first_name) employee -- This is added but doesn't change anything

I think modifying the regex that the select literal matches could help: https://regexr.com/6qllk

((?:from|join)(?:\s*--.*)?[\r\n\s]*)(`my-project`.`my-dataset`.`my-table`)(?:;?\s*--.*)?[\r\n\s]*((?:(?:AS)?(?:\s*--.*)?[\r\n\s]*.+))?

There are 3 examples here. If capturing group 3 is empty then we need to add my-table as an alias if capturing group 3 is not empty then do nothing and keep the old behaviour. I am happy to pick this up if you like as I can imagine that regex is quite opaque.

The first part ((?:from|join)(?:\s*--.*)?[\r\n\s]*)(my-project.my-dataset.my-table) is saying:

  • Look for 'from' or 'join' and then optionally some new lines or SQL comments (-- followed by some words)
  • Then look for the table reference

The second part (?:;?\s*--.*)?[\r\n\s]*((?:(?:AS)?(?:\s*--.*)?[\r\n\s]*.+))? can be broken up into:

  • Optionally look for either the end of a SQL statement, whitespace or a comment
  • Then look for optionally the word 'AS'
  • Then optionally followed by more whitespace/new line/comment
  • Followed by some words (The table alias)

But it would take a lot of convincing/testing to show that this works in all cases within valid SQL syntax. Maybe moving away from a pure regex based implementation and trying to parse the SQL a bit more will help solve this issue

ghost avatar Jul 27 '22 16:07 ghost

Hey @connor-charles, I think you are right regarding the cases and best solution - we need to detect when the upstream model is not aliased and alias it.

Regarding the regex, thanks for the detailed and clear explanation but as you mentioned in my case it would need more convincing / testing than in yours (as you have already made it).

If you can pick this up would be great, but a comment regarding your previous suggestion If capturing group 3 is empty then we need to add my-table as an alias if capturing group 3 is not empty then do nothing and keep the old behaviour: looks like in the regex example you sent before the third case would need to be aliased but capturing group 3 is not empty. Maybe that could be sorted amending the still a bit more or maybe as you mentioned is a better idea to parse the SQL a bit more

alvfermar-mp avatar Jul 28 '22 07:07 alvfermar-mp

I have pushed a commit to improve the test coverage for the literal insertion as there are lot of cases to cover in the SQL parsing logic. There is a library called sqlparse which I have tried to use before with the dry runner but I found it a bit overkill for the small amount of parsing the dry runner needs to do.

I think a decent algorithm for this is to iterate backwards from the table reference you are trying to replace to determine if it is a correct candidate for replacing it with a select literal. If you find a FROM or JOIN before then it is a real table reference. If you find a quote then its probably just a string of the table reference inside the query. Need to disregard comments as well. If we take the case in the test test_replace_upstream_sql_does_not_replace_alias_in_string:

    SELECT bar.foo,
           '{node.to_table_ref_literal()}' as the_table
    FROM {node.to_table_ref_literal()} as bar

If we split the query by whitespace characters the array should look like this:

['SELECT', 'bar.foo,', "'`my_db`.`my_schema`.`A`'", 'as', 'the_table', 'FROM', '`my_db`.`my_schema`.`A`', 'as', 'bar']

Then if we search for `my_db`.`my_schema`.`A` in these tokens we find element 6. Working backwards we hit FROM immediately so we can replace element 6 with the select literal. Element 2 is in quotes so we won't replace it.

To handle comments its a bit more complicated this test we previously had to xfail because the regex couldn't match it properly:

    SELECT foo
    FROM -- test
         -- test2
        `my_db`.`my_schema`.`A`

The split SQL query is:

['SELECT', 'foo', 'FROM', '--', 'test', '--', 'test2', '`my_db`.`my_schema`.`A`']

We find the table reference in the last element. Working backwards I think it's possible to detect that the elements between then and the FROM are comments and then we can replace the table reference. Will try and implement this soon and see how it goes, would be good to get rid of the regex.

ghost avatar Aug 12 '22 08:08 ghost