hamilton
hamilton copied to clipboard
Data quality next plans POC
OK so this is a pure proof of concept. Not necessarily the right way to do things, and not tested. That said, I wanted to prove the following:
- That we could build a two-step data quality pass (E.G. with a profiler and a validator). This will quickly be a whylogs blocker.
- That we can use config to enable/disable items at run/compile time.
- That we can add an applies_to keyword to narrow focus of data quality.
(1) is useful for integrations with complex stuff -- E.G. an expensive profiling step with lots of validations.
(2) is useful for disabling -- this will probably be the first we release.
(3) is useful for extract_columns
-- it now makes it clear what it applies to.
While some of this code still has placeholders and isn't tested, it demonstrates feasible solutions, and de-risks the release of data quality enough to make me comfortable.
Look through commits for more explanations.
Changes
Testing
Notes
Checklist
- [ ] PR has an informative and human-readable title (this will be pulled into the release notes)
- [ ] Changes are limited to a single goal (no scope creep)
- [ ] Code can be automatically merged (no conflicts)
- [ ] Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
- [ ] Passes all existing automated tests
- [ ] Any change in functionality is tested
- [ ] New functions are documented (with a description, list of inputs, and expected output)
- [ ] Placeholder code is flagged / future TODOs are captured in comments
- [ ] Project documentation has been updated if adding/changing functionality.
- [ ] Reviewers requested with the Reviewers tool :arrow_right:
Testing checklist
Python - local testing
- [ ] python 3.6
- [ ] python 3.7