automate-dv
automate-dv copied to clipboard
[FEATURE] Built-in support for configuring NULL business key handling
This is a long-awaited and heavily requested feature.
We have a plan to implement this as follows:
Add global variables for configuring the substitution values
vars:
hash: MD5
concat_string: '||'
...
null_key_required: -1 # Default -1, allows user to configure
null_key_optional: -2 # Default -2, allows user to configure
Add a new section to the stage configurations
source_model: raw_source
hashed_columns:
CUSTOMER_HK: CUSTOMER_ID
CUST_CUSTOMER_HASHDIFF:
is_hashdiff: true
columns:
- CUSTOMER_DOB
- CUSTOMER_ID
- CUSTOMER_NAME
- "!9999-12-31"
CUSTOMER_HASHDIFF:
is_hashdiff: true
columns:
- CUSTOMER_ID
- NATIONALITY
- PHONE
derived_columns:
RECORD_SOURCE: "!STG_BOOKING"
EFFECTIVE_FROM: BOOKING_DATE
ranked_columns:
DBTVAULT_RANK:
partition_by: CUSTOMER_ID
order_by: BOOKING_DATE
null_columns: # HERE
required:
- CUSTOMER_ID
optional:
- ORDER_ID
This would:
- Replace the NULLs found in
CUSTOMER_ID
with-1
(by default) - Replace the NULLs found in
ORDER_ID
with-2
by default - Ensure
CUSTOMER_HK
contains the hash of-1
(md5:6BB61E3B7BCE0931DA574D19D1D82C88
) - Ensure
ORDER_HK
contains the hash of-2
(md5:5D7B9ADCBE1C629EC722529DD12E5129
)
What about non-configured NULLs?
These would be left as NULL values and ignored by table macros as appropriate (this is the existing functionality)
What about audit?
The standard we will follow here is to create a new column holding the original value, as below.
CUSTOMER_ID
prior to NULL handling:
CUSTOMER_ID |
---|
NULL |
1001 |
CUSTOMER_ID
after NULL handling:
CUSTOMER_ID | CUSTOMER_ID_ORIGINAL |
---|---|
-1 | NULL |
1001 | 1001 |
_ORIGINAL
will not stay as the suffix for this, we're working on something better, suggestions welcome!
Note: As you can also see above, if a value is not NULL then we will not replace it.
Why allow users to configure the substitution value, isn't there a standard?
Standards do not fit every use case. Some data may actually use -1 as a business key for whatever reason, configuration allows this kind of scenario to be overcome more easily.
When will this be released?
We're aiming to add this in the next release
I have more questions
Please comment below 😄