dlt-meta icon indicating copy to clipboard operation
dlt-meta copied to clipboard

Silver Data Quality Not working

Open JunHongPP opened this issue 10 months ago • 2 comments

Data quality check not working for Silver. specified expecte_or_drop as product_id IS NULL but those record with product id still inserted

Error Reproduce:

Direct Insert into Flow Spec. INSERT INTO nonprod.meta.silver_dataflowspec_table SELECT '1031' AS dataFlowId ,'A1' AS dataFlowGroup ,'delta' AS sourceFormat ,MAP("database","bronze","table","t_brz_products") AS sourceDetails ,MAP() AS readerConfigOptions ,'delta' AS targetFormat ,MAP("database","silver","table","t_slv_products_dq_log") AS targetDetails ,MAP() AS tableProperties ,Array("*") AS selectExp ,Array("") AS whereClause ,Array("") AS partitionColumns ,null AS cdcApplyChanges ,'{"expect_or_drop": {"null_product_id": "product_id IS NULL"}}' AS dataQualityExpectations ,null AS appendFlows ,MAP() AS appendFlowsSchemas ,'v1' AS version ,current_timestamp AS createDate ,'xxxxxxx' AS createdBy ,current_timestamp AS updateDate ,'xxxxxx'AS updatedBy;

Data Quality Rule: {"expect_or_drop": {"null_product_id": "product_id IS NULL"}}

Bronze data: Image

DLT Result: Image

Silver Table Result: whole set of record is inserted Image

JunHongPP avatar Feb 17 '25 05:02 JunHongPP

@JunHongPP If you use apply_changes in the silver layer, expectations should work!

In the medallion architecture, the bronze layer is append-only, while the silver layer supports Type 1 and Type 2 merges. To ensure data quality, we explicitly support quarantine and expectations in the bronze layer. However, when using apply_changes in silver, we leverage CREATE STREAMING TABLE, which includes arguments for expectations.

Try running your test with apply_changes using a Type 1 merge in silver—it should work as expected! Additionally, you can define your silver layer as dlt-meta bronze pipeline to apply expectations in append mode within your pipeline.

ravi-databricks avatar Feb 20 '25 22:02 ravi-databricks

Thanks. From what I understand the library is required CDC only the data quality will work.

Aside from that, I understand that data deduplicate is handle by DLT's CDC feature really well. But is that possible to load duplicate records into the silver table without performing deduplication. However, before inserting new records, the process should first delete any existing records that have the same key (as per the defined primary or unique key constraints). This ensures that duplicates can be loaded while maintaining key uniqueness at the time of insertion.

JunHongPP avatar Feb 25 '25 10:02 JunHongPP