hamilton Data quality next plans POC

Data quality next plans POC

Open elijahbenizzy opened this issue 2 years ago • 0 comments

OK so this is a pure proof of concept. Not necessarily the right way to do things, and not tested. That said, I wanted to prove the following:

That we could build a two-step data quality pass (E.G. with a profiler and a validator). This will quickly be a whylogs blocker.
That we can use config to enable/disable items at run/compile time.
That we can add an applies_to keyword to narrow focus of data quality.

(1) is useful for integrations with complex stuff -- E.G. an expensive profiling step with lots of validations. (2) is useful for disabling -- this will probably be the first we release. (3) is useful for extract_columns -- it now makes it clear what it applies to.

While some of this code still has placeholders and isn't tested, it demonstrates feasible solutions, and de-risks the release of data quality enough to make me comfortable.

Look through commits for more explanations.

Changes

Testing

Notes

Checklist

[ ] PR has an informative and human-readable title (this will be pulled into the release notes)
[ ] Changes are limited to a single goal (no scope creep)
[ ] Code can be automatically merged (no conflicts)
[ ] Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
[ ] Passes all existing automated tests
[ ] Any change in functionality is tested
[ ] New functions are documented (with a description, list of inputs, and expected output)
[ ] Placeholder code is flagged / future TODOs are captured in comments
[ ] Project documentation has been updated if adding/changing functionality.
[ ] Reviewers requested with the Reviewers tool :arrow_right:

Testing checklist

Python - local testing

[ ] python 3.6
[ ] python 3.7

Jul 04 '22 22:07 elijahbenizzy

hamilton hamilton copied to clipboard

Data quality next plans POC

Changes

Testing

Notes

Checklist

Testing checklist

Python - local testing

hamilton
hamilton copied to clipboard