dbt-oracle icon indicating copy to clipboard operation
dbt-oracle copied to clipboard

[Bug] hash_collissions in dbt snapshot

Open espenhoh opened this issue 1 year ago • 1 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

snapshot is implemented with the ora_hash() function. This gives a lot of collissions in a large data set and duplicate ids, causing errors in the merge statement for data volumes of millions of rows. dbt snapshot will fail once a hash collission appears, and the snapshot tables becomes impossible to update.

Expected Behavior

No collissions,

Steps To Reproduce

run dbt snapshot command on data with a few million updates. First dbt snapshot command succeeds as there are only inserts, but duplicates in dbt_scd_id exists causing trouble on subsequest runs.

Relevant log output using --debug flag enabled

06:08:52    Database Error in snapshot table_name (snapshots\table_name.sql)
  ORA-30926: unable to get a stable set of rows in the source tables
  Help: https://docs.oracle.com/error-help/db/ora-30926/
  compiled Code at target\run\dbt_project\snapshots\table_name.sql

Environment

- OS: Windows
- Python: 3.11
- dbt: 1.8.1

What Oracle database version are you using dbt with?

19c

Additional Context

No response

espenhoh avatar Sep 03 '24 11:09 espenhoh

Hi @espenhoh

I understand the issue because we have seen these in the past.

#52 #102

And yes, we decided not to update ORA_HASH function because it would break existing snapshots, Users who are starting fresh snapshots should include the following macro in their dbt project. This uses a much stronger hash function - SHA256

{% macro oracle__snapshot_hash_arguments(args) -%}
    STANDARD_HASH({%- for arg in args -%}
        coalesce(cast({{ arg }} as varchar(4000) ), '')
        {% if not loop.last %} || '|' || {% endif %}
    {%- endfor -%}, 'SHA256')
{%- endmacro %}

For existing snapshots, to migrate, there is an option.

dbt snapshot uses MERGE INTO statement.

MERGE INTO target t USING staging s ON (s.dbt_scd_id = d.dbt_scd_id)

So, before using the new hash function you need to UPDATE the column dbt_scd_id in the snapshot table using the STANDARD_HASH function.

UPDATE <SNAPSHOT_TABLE> 
SET dbt_scd_id = STANDARD_HASH(<args>, 'SHA256')

aosingh avatar Sep 04 '24 18:09 aosingh

@espenhoh

dbt-oracle==1.9.1rc1 release candidate is out where we have changed the hash algorithm to SHA256

aosingh avatar Mar 31 '25 16:03 aosingh

For anybody who is trying to migrate existing dbt-oracle snapshots with the old ora_hash() to be 1.9.1 compatible:

With dbt-oracle 1.9.1 in place run the following operation on your snapshot table with PROMOTION_COSTS_SNAPSHOT being your snapshot u want to migrate and promo_id being your unique Identifier of the snapshot.

dbt --debug run-operation update_legacy_dbt_scd_id \
    --args '{snapshot_table: PROMOTION_COSTS_SNAPSHOT, cols: ["promo_id", "dbt_updated_at"]}'

This way is found in the code comments here: https://github.com/oracle/dbt-oracle/blob/607f3321268a6380e3e88b3e679604eec83a12dd/dbt/include/oracle/macros/update_legacy_snapshots.sql#L33

The Oracle error indicating that you have to migrate is an ORA-01790 like: oracle adapter: Oracle error: ORA-01790: expression must have same datatype as corresponding expression With the new Hash function the dbt_scd_id of the snapshot had to change from NUMBER to RAW(32) causing the datatype collision.

Hope that helps anybody with the same Problem.

itgkarl avatar Apr 23 '25 07:04 itgkarl

dbt-oracle==1.9.1 switched to SHA256 hashing algorithm for snapshots. This should address the hash collisions problem.

aosingh avatar Jun 16 '25 16:06 aosingh