dlt
dlt copied to clipboard
custom handling of "bad data" in data contracts
Background We want our users to handle the "bad data" themselves instead of what our data and schema contracts allow in #135
- define a custom sink for the data items that violate the contract
- provide a built-in sink that will send data items to a table
ad 1. fallback could be a function that takes a data items (or a list of them), some context and returns a (possibly modified) data item or None if it should be discarded We could implement our current behavior as default sink(s) (freeze, discard, discard values)
ad 2. if marked so - the data item will be send to a bad data table. we could store data item in it as a string blob + the context of a contract (schema, table, column(s) etc.)
Discussion
- if sink is a callback function we'll have problems in normalizer which is multiprocess. possibly a sink must be a module that can be imported on demand?
- pydantic validators must be fully integrated. we may switch to pydantic v2. only which allows to collect additional information during validation.
- in extract phase marking items with a table name is already implemented (
with_table_name
). this seems to be good interface for the users. we can interpret the same meta in normalizer. - for arrow tables/panda frames we probably want to store them as json in bad data?
@sh-rp @codingcyclist any comments are welcome :)