datavault4dbt icon indicating copy to clipboard operation
datavault4dbt copied to clipboard

Effectivity handled with duplicate unique keys

Open cjpotgieter opened this issue 1 year ago • 9 comments

Hi I have a question on how the logic handles your proposed approach of handling effectivity that comes from the source. You specify that these attributes should just be handled like normal descriptive attributes. In the Satellite it contains a parameter for unique key which is typically only the hk and the load date. Is this still the case? Does dbt create a primary key on the table for this, or is this just used for incremental load checks? Should one include the effectivity dates in here as well?

We did a small test on the sql branch (which we got when it was still on) and it actually seemed to work because there was no primary key on the actual tables, so the fact that there were duplicate "unique keys" didn't seem to error. However we didn't test subsequent loads

cjpotgieter avatar Jul 26 '23 07:07 cjpotgieter

Hi @cjpotgieter and thanks for reaching out!

Does dbt create a primary key on the table for this, or is this just used for incremental load checks?

Dbt does not create a primary key based on this! The information specified in the "unique key" config will only influence the loading mechanism in the "incremental" materialisation. But it is not required when using our package, since we handle all incremental logic ourselves.

Should one include the effectivity dates in here as well?

Since it does affect the Primary Key constraint, you don't need to include them there. Could you describe your data a bit more? Are you getting multiple effectivity ranges for one HK with the same load_date?

Best regards, Tim

tkirschke avatar Jul 27 '23 06:07 tkirschke

The data: Yes, exactly, with each load there is a new batch of data with from and to dates. So the hash key (from business key) and load date will be repeated multiple times, but each of the (lets say 20) records will have a different from and to date.

cjpotgieter avatar Jul 27 '23 06:07 cjpotgieter

I have a master data source where historical data gets updated with new 'valid from' and 'valid to' periods. In each batch, I receive the full history, including valid/to periods, per business key. If I understand correctly, your case sounds similar. You could use a multi-active satellite with a ma key, such as ROW_NUMBER() OVER (PARTITION BY hk,ldts order by valid_from, ISNULL(valid_to, '8888-12-31')) and include all the columns as payload

cmettler avatar Jul 28 '23 14:07 cmettler

It does sound similar yes. Can I check via your MA satellite wiki example: Should the src_ma_key be set to the ROW_NUMBER expression that you mentioned and NOT a value from source?

{{ config(materialized='incremental') }}

{%- set yaml_metadata -%} source_model: 'stg_customer' parent_hashkey: 'hk_customer_h' src_hashdiff: 'hd_customer_s' src_ma_key: 'ma_attribute' src_payload: - phonenumber - address {%- endset -%}

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{{ datavault4dbt.ma_sat_v0(source_model=metadata_dict['source_model'], parent_hashkey=metadata_dict['parent_hashkey'], src_hashdiff=metadata_dict['src_hashdiff'], src_ma_key=metadata_dict['src_ma_key'], src_payload=metadata_dict['src_payload']) }}

cjpotgieter avatar Jul 31 '23 07:07 cjpotgieter

@cmettler @tkirschke Any answer on this?

cjpotgieter avatar Aug 14 '23 10:08 cjpotgieter

Hi @cjpotgieter, yes, you also need to change the stage to multi active by adding this

derived_columns: ma_attribute: value: 'row_number() over (partition by customer_key,ldts order by validfrom,validto)' datatype: int

include_source_columns: true multi_active_config: multi_active_key: - ma_attribute main_hashkey_column: 'hk_customer_h'

please note that this is my personal view and not that of scalefree.

cmettler avatar Aug 15 '23 10:08 cmettler

Ok thanks @cmettler What is the scalefree opinion?

cjpotgieter avatar Aug 15 '23 10:08 cjpotgieter

@cjpotgieter Hi thanks for the question. In your case I believe I would also use a multiactive satellite, but instead of a window function I would rather use the valid_from and valid_to columns you have as multi active attributes. So for each business key coming from the source you would have a different set of valid_from and valid_to dates per load_datetime. Try it and see if it fits your use case :) Also remember the hashdiff is calculated per group. Hope this helps!

bschlottfeldt avatar Aug 15 '23 14:08 bschlottfeldt

Thanks @bschlottfeldt I have also heard that for this type of load Scalefree prefers to add nanoseconds to the load date and then not create a multi-active satellite.

  1. Do you agree with this approach?
  2. Is there a way to do the nano second approach with your dbt package?

cjpotgieter avatar Aug 15 '23 14:08 cjpotgieter