kedro icon indicating copy to clipboard operation
kedro copied to clipboard

[DataCatalog]: Add a data schema evaluation mechanism

Open ElenaKhaustova opened this issue 8 months ago • 2 comments

Description

Users express the need for data schema evaluation to enable "fail-fast" capabilities during data loading and consistency checks before execution. They highlight the potential benefits of schema evaluation in integrating with other services, validating pipelines before execution, and running API checks.

We propose to explore the feasibility and necessity of implementing a data schema evaluation mechanism.

Relates to https://github.com/kedro-org/kedro/issues/3613

Context

Responses obtained during user research interview:

  • Integration with Services and Pre-Run Checks: Schema evaluation is crucial for integrating with external services and for conducting pre-pipeline execution checks. This ensures that the data conforms to expected schemas, allowing for validation before processing begins, enhancing reliability and reducing errors during runtime.
  • Implementation Concerns and Flexibility: Implementing schema checks at the catalog level could complicate the system due to the need to bridge static data configurations with dynamic runtime requirements. The current method annotates Python functions directly, which links schemas more tightly with the execution logic and provides immediate feedback during development, aiding in maintaining type safety and contractual adherence.
  • Potential for Catalog-Level Implementation: While the current approach focuses on runtime validations tied to Python code, there's recognized potential in extending schema validations to the catalog level. This would allow for offline checks, enabling "fail-fast" capabilities during data loading and consistency checks before execution. This dual approach could provide comprehensive coverage, ensuring data integrity both at rest and in motion, and could align with practices seen in other data management frameworks like DBT, which supports schema checks both at rest and during execution.

ElenaKhaustova avatar Jun 06 '24 19:06 ElenaKhaustova