dagster icon indicating copy to clipboard operation
dagster copied to clipboard

Draft: [dagster-polars] Pandera integration for DataFrame schema validation :safety_vest:

Open danielgafni opened this issue 8 months ago • 1 comments

Summary & Motivation

This PR:

  • :sparkles: adds Pandera support to dagster-polars (resolve #20584). The goal is to be able to load/save Pandera-validated DataFrames like this:
from dagster import asset
import pandera
import pandera.typing.polars
import polars as pl

class MySchema(pandera.DataFrameModel):
    foo: str
    bar: int


@asset(io_manager_key="polars_parquet_io_manager")
def upstream() -> pandera.typing.polars.DataFrame[MySchema]:
    return pl.DataFrame({"foo": ["a", "b", "c"], "bar": [1, 2, 3]})


@asset(io_manager_key="polars_parquet_io_manager")
def downstream(upstream: pandera.typing.polars.LazyFrame[MySchema]):
    ...

  • :art: refactors logic around loading Optional, Dict, Eager, Lazy, and Pandera-typed DataFrames with new recursive TypeRouter (needs a better name?) helper class. This made the base IOManager logic much cleaner while supporting all combinations of these types. Also, TypeRouter can potentially be used in the BigQuery IOManager, and in general in other non-upath IOManagers.
  • :boom: [TBD] drops support for storing extra metadata in storage as it became too hard to maintain. Extra jsons can be stored in separate assets. This decision is debatable and not final.

How I Tested These Changes

Existing tests pass, proving the TypeRouter recursive type resolution to be correct.

New tests for Pandera to be added.

danielgafni avatar Jun 24 '24 16:06 danielgafni