astro-sdk icon indicating copy to clipboard operation
astro-sdk copied to clipboard

Analyse the features available in Great Expectations and review the need for Astro custom checks

Open tatiana opened this issue 2 years ago • 1 comments

Context At the moment, Astro offers a few table checks (stats, boolean, aggregate). It may have an overlap with the Great Expectations package: https://github.com/great-expectations/airflow-provider-great-expectations We may want to check: https://github.com/astronomer/internal_data_quality/pull/7/files

Acceptance criteria

  • Study what Great Expectations offer
  • Identify if there are any intersections with Astro Python SDK checks
  • Recommend how we want to move forward (e.g. deprecate the Astro Python SDK checks, have the Astro Python SDK offer a layer on top of the Great Expectations provider, or continue independent paths)

tatiana avatar Apr 07 '22 06:04 tatiana

After doing some research with @denimalpaca, and here's what we found so far:

  1. Great Expectations does have a SQLAlchemy batch generator, so it would be possible to run the queries against the SQL database. One thing to note is that this generator is using f-strings for generating queries, so not OPTIMAL on security, but it's also not our library so maybe we can pass the buck there(?)
  2. The current version of the astro great-expectation library would not be good for our use-case because it's not really an "airflow native" experience. The user would need to set up a separate great-expectations and manage both projects. They would also need system configs in a file to use the current operator.
  3. I think that @denimalpaca and I could build a decorator or an operator e.g. (aql.great_expectations). This operator could automate the configurations, automate execution engine (e.g. if given a Dataframe use Pandas, if Table use SQLAlchemy, etc.). By making this a decorator we could create a simple interface where users can write their checks as python functions.

@denimalpaca thinks that this would be a really great add-on for teams that want to use Airflow for ML-based pipelines, as Great Expectations would be now feel "seamless", and they can pass in the dataframe they're processing as their DAG is moving forward.

dimberman avatar Apr 26 '22 20:04 dimberman

Since this ticket was initially logged, @utkarsharma2 has led a new work stream on data validation. We can create new ones in future as needed. @utkarsharma2 please add links following up on the current validation story state.

tatiana avatar Jan 17 '23 09:01 tatiana