dbt-oracle
dbt-oracle copied to clipboard
[Bug] hash_collissions in dbt snapshot
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
snapshot is implemented with the ora_hash() function. This gives a lot of collissions in a large data set and duplicate ids, causing errors in the merge statement for data volumes of millions of rows. dbt snapshot will fail once a hash collission appears, and the snapshot tables becomes impossible to update.
Expected Behavior
No collissions,
Steps To Reproduce
run dbt snapshot command on data with a few million updates. First dbt snapshot command succeeds as there are only inserts, but duplicates in dbt_scd_id exists causing trouble on subsequest runs.
Relevant log output using --debug flag enabled
06:08:52 Database Error in snapshot table_name (snapshots\table_name.sql)
ORA-30926: unable to get a stable set of rows in the source tables
Help: https://docs.oracle.com/error-help/db/ora-30926/
compiled Code at target\run\dbt_project\snapshots\table_name.sql
Environment
- OS: Windows
- Python: 3.11
- dbt: 1.8.1
What Oracle database version are you using dbt with?
19c
Additional Context
No response
Hi @espenhoh
I understand the issue because we have seen these in the past.
#52 #102
And yes, we decided not to update ORA_HASH function because it would break existing snapshots, Users who are starting fresh snapshots should include the following macro in their dbt project. This uses a much stronger hash function - SHA256
{% macro oracle__snapshot_hash_arguments(args) -%}
STANDARD_HASH({%- for arg in args -%}
coalesce(cast({{ arg }} as varchar(4000) ), '')
{% if not loop.last %} || '|' || {% endif %}
{%- endfor -%}, 'SHA256')
{%- endmacro %}
For existing snapshots, to migrate, there is an option.
dbt snapshot uses MERGE INTO statement.
MERGE INTO target t USING staging s ON (s.dbt_scd_id = d.dbt_scd_id)
So, before using the new hash function you need to UPDATE the column dbt_scd_id in the snapshot table using the STANDARD_HASH function.
UPDATE <SNAPSHOT_TABLE>
SET dbt_scd_id = STANDARD_HASH(<args>, 'SHA256')
@espenhoh
dbt-oracle==1.9.1rc1 release candidate is out where we have changed the hash algorithm to SHA256
For anybody who is trying to migrate existing dbt-oracle snapshots with the old ora_hash() to be 1.9.1 compatible:
With dbt-oracle 1.9.1 in place run the following operation on your snapshot table with PROMOTION_COSTS_SNAPSHOT being your snapshot u want to migrate and promo_id being your unique Identifier of the snapshot.
dbt --debug run-operation update_legacy_dbt_scd_id \
--args '{snapshot_table: PROMOTION_COSTS_SNAPSHOT, cols: ["promo_id", "dbt_updated_at"]}'
This way is found in the code comments here: https://github.com/oracle/dbt-oracle/blob/607f3321268a6380e3e88b3e679604eec83a12dd/dbt/include/oracle/macros/update_legacy_snapshots.sql#L33
The Oracle error indicating that you have to migrate is an ORA-01790 like:
oracle adapter: Oracle error: ORA-01790: expression must have same datatype as corresponding expression
With the new Hash function the dbt_scd_id of the snapshot had to change from NUMBER to RAW(32) causing the datatype collision.
Hope that helps anybody with the same Problem.
dbt-oracle==1.9.1 switched to SHA256 hashing algorithm for snapshots. This should address the hash collisions problem.