automate-dv icon indicating copy to clipboard operation
automate-dv copied to clipboard

[FEATURE] Built-in support for configuring NULL business key handling

Open DVAlexHiggs opened this issue 1 year ago • 1 comments

This is a long-awaited and heavily requested feature.

We have a plan to implement this as follows:

Add global variables for configuring the substitution values

vars:
  hash: MD5
  concat_string: '||'
  ...
  null_key_required: -1 # Default -1, allows user to configure
  null_key_optional: -2 # Default -2, allows user to configure

Add a new section to the stage configurations

source_model: raw_source
hashed_columns:
  CUSTOMER_HK: CUSTOMER_ID
  CUST_CUSTOMER_HASHDIFF:
    is_hashdiff: true
    columns:
      - CUSTOMER_DOB
      - CUSTOMER_ID
      - CUSTOMER_NAME
      - "!9999-12-31"
  CUSTOMER_HASHDIFF:
    is_hashdiff: true
    columns:
      - CUSTOMER_ID
      - NATIONALITY
      - PHONE
derived_columns:
  RECORD_SOURCE: "!STG_BOOKING"
  EFFECTIVE_FROM: BOOKING_DATE
ranked_columns:
  DBTVAULT_RANK:
    partition_by: CUSTOMER_ID
    order_by: BOOKING_DATE
null_columns: # HERE
  required:
    - CUSTOMER_ID
  optional:
    - ORDER_ID

This would:

  • Replace the NULLs found in CUSTOMER_ID with -1 (by default)
  • Replace the NULLs found in ORDER_ID with -2 by default
  • Ensure CUSTOMER_HK contains the hash of -1 (md5: 6BB61E3B7BCE0931DA574D19D1D82C88)
  • Ensure ORDER_HK contains the hash of -2 (md5: 5D7B9ADCBE1C629EC722529DD12E5129)

What about non-configured NULLs?

These would be left as NULL values and ignored by table macros as appropriate (this is the existing functionality)

What about audit?

The standard we will follow here is to create a new column holding the original value, as below.

CUSTOMER_ID prior to NULL handling:

CUSTOMER_ID
NULL
1001

CUSTOMER_ID after NULL handling:

CUSTOMER_ID CUSTOMER_ID_ORIGINAL
-1 NULL
1001 1001

_ORIGINAL will not stay as the suffix for this, we're working on something better, suggestions welcome!

Note: As you can also see above, if a value is not NULL then we will not replace it.

Why allow users to configure the substitution value, isn't there a standard?

Standards do not fit every use case. Some data may actually use -1 as a business key for whatever reason, configuration allows this kind of scenario to be overcome more easily.

When will this be released?

We're aiming to add this in the next release

I have more questions

Please comment below 😄

DVAlexHiggs avatar Jul 06 '22 09:07 DVAlexHiggs