union_relations guesses data types incorrectly so database fails
Describe the bug
I am using dbt_utils.union_relations to union two tables that have the same column names, but not same data types. So the macro compiles the code by doing a SELECT cast(col1 as {data_type}) as col1.... for each table it has in the UNION on the compiled code. However, I am trying to union two models A and B, both have already materialized as tables. And Both have a column called keyword_id as numeric but of different size. Table A has it defined as numeric(1,0) while table B as numeric(18,0) and it's fine because at table A the values are single digits while at B they are very large.
The issue is, when I do dbt_utils.union_relations(relations=[ref('A'),ref('B')] it fails, while if I just invert the order and do dbt_utils.union_relations(relations=[ref('B'),ref('A')] it works
I checked the compiled code to see why. And its because when A comes before B, it guesses the data type of col keyword_id as number(1,0) (in other words, it took its decision based on table A apparently). While in the scenario where we have table B and then A it guesses the data type as number(18,0).
In the compiled code of scenario 1 (A and then B) it does
SELECT ...., cast(keyword_id as number(1,0)) as keyword_id,...<more_columns> FROM A
UNION
SELECT ...., cast(keyword_id as number(1,0)) as keyword_id,...<more_columns> FROM B
While in scenario 2 (B then A):
SELECT ...., cast(keyword_id as number(18,0)) as keyword_id,...<more_columns> FROM B
UNION
SELECT ...., cast(keyword_id as number(18,0)) as keyword_id,...<more_columns> FROM A
So the order in which we add the relations in the macro has a an impact on how it guesses the data types. And in a certain combination of order of tables and their data types it might choose data types that will error out.
Steps to reproduce
Have two models which have the same column names, they can be identical, just in one model add a column keyword_id as always a single digit number; and in the other model make it a lot of digits (18 for example). Both models need to materialize as tables (in my case it was in snowflake) and have to be built and populated. Then create a third model and use the union macro to see the compile code in the scenario where you put the order of relations as model1 and then model2, vs the reverse scenario. Under the compile code check the way it casts the keyword_id column and how it changes from one scenario to another
Expected results
Ideally, the order on how we pass the references to the macro should not impact in the final result on how it guesses the data type. Or in another possible solution, the union macro should accept some data types parameter or some parameter to define a strategy on how to guess data types
System information
The contents of your packages.yml file:
packages:
- package: dbt-labs/dbt_utils version: 1.3.0
- package: calogica/dbt_expectations version: [">=0.10.0", "<0.11.0"]
Which database are you using dbt with?
- [ ] postgres
- [ ] redshift
- [ ] bigquery
- [x] snowflake
- [ ] other (specify: ____________)
The output of dbt --version:
packages:
- package: dbt-labs/dbt_utils
version: 1.3.0
- package: calogica/dbt_expectations
version: [">=0.10.0", "<0.11.0"]
Are you interested in contributing the fix?
I would like to contribute if I had some guidance, but I have never navigated the repo before and I am not an expert on the source code
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
We are running into this issue as well. Agreed that the order should not matter, and the default behavior should be to use the data type that will lead to a successful model run (number(18,0) in the case above).
I have exactly the same issue