mack icon indicating copy to clipboard operation
mack copied to clipboard

Brainstorm data quality features

Open robertkossendey opened this issue 2 years ago • 6 comments

Constraints are great data quality features that allow users to define filters / rules that identify invalid records. But the only allow for fail on invalid records and I think we could do better.

Some ideas:

  • Ability to automatically drop invalid rows
  • Ability to automatically mark rows as invalid in target table ~~- Ability to write invalid rows to "Quarantine" table~~

WDYT @MrPowers

robertkossendey avatar Dec 30 '22 14:12 robertkossendey

@robertkossendey - these sound like good suggestions. I'm guessing some external libs would help for this type of functionality (Great Expectations or PyDeequ perhaps), but don't want to add any dependencies to this lib. Let's keep this open as a "meta-issue". When you have ideas for individual functions, feel free to open up a separate issue and we can chat in detail before you put in the work. Thanks!

MrPowers avatar Dec 30 '22 16:12 MrPowers

@MrPowers I wouldn't like to use any other framework tbh. If you're okay with it I would create a PoC PR that allows you to specify a condition and if that condition is not fulfilled a write would fail.

robertkossendey avatar Dec 30 '22 16:12 robertkossendey

@robertkossendey - yep, PoC PR sounds like a great next step!

MrPowers avatar Dec 30 '22 16:12 MrPowers

@robertkossendey @MrPowers

Hey guys I actually had built a library to mock the dlt behaviors outside of databricks: dlt-with-debug

I think I can take out the expectation mock apis and add them here in mack.

souvik-databricks avatar Dec 30 '22 17:12 souvik-databricks

@souvik-databricks very cool! Maybe you can open up a PR and we can collaborate on that then :)

robertkossendey avatar Dec 30 '22 18:12 robertkossendey

I will raise the PR on this @robertkossendey

souvik-databricks avatar Dec 30 '22 18:12 souvik-databricks