hamilton icon indicating copy to clipboard operation
hamilton copied to clipboard

Data quality next plans POC

Open elijahbenizzy opened this issue 2 years ago • 0 comments

OK so this is a pure proof of concept. Not necessarily the right way to do things, and not tested. That said, I wanted to prove the following:

  1. That we could build a two-step data quality pass (E.G. with a profiler and a validator). This will quickly be a whylogs blocker.
  2. That we can use config to enable/disable items at run/compile time.
  3. That we can add an applies_to keyword to narrow focus of data quality.

(1) is useful for integrations with complex stuff -- E.G. an expensive profiling step with lots of validations. (2) is useful for disabling -- this will probably be the first we release. (3) is useful for extract_columns -- it now makes it clear what it applies to.

While some of this code still has placeholders and isn't tested, it demonstrates feasible solutions, and de-risks the release of data quality enough to make me comfortable.

Look through commits for more explanations.

Changes

Testing

Notes

Checklist

  • [ ] PR has an informative and human-readable title (this will be pulled into the release notes)
  • [ ] Changes are limited to a single goal (no scope creep)
  • [ ] Code can be automatically merged (no conflicts)
  • [ ] Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • [ ] Passes all existing automated tests
  • [ ] Any change in functionality is tested
  • [ ] New functions are documented (with a description, list of inputs, and expected output)
  • [ ] Placeholder code is flagged / future TODOs are captured in comments
  • [ ] Project documentation has been updated if adding/changing functionality.
  • [ ] Reviewers requested with the Reviewers tool :arrow_right:

Testing checklist

Python - local testing

  • [ ] python 3.6
  • [ ] python 3.7

elijahbenizzy avatar Jul 04 '22 22:07 elijahbenizzy