dagster
dagster copied to clipboard
Draft: [dagster-polars] Pandera integration for DataFrame schema validation :safety_vest:
Summary & Motivation
This PR:
- :sparkles: adds Pandera support to dagster-polars (resolve #20584). The goal is to be able to load/save Pandera-validated DataFrames like this:
from dagster import asset
import pandera
import pandera.typing.polars
import polars as pl
class MySchema(pandera.DataFrameModel):
foo: str
bar: int
@asset(io_manager_key="polars_parquet_io_manager")
def upstream() -> pandera.typing.polars.DataFrame[MySchema]:
return pl.DataFrame({"foo": ["a", "b", "c"], "bar": [1, 2, 3]})
@asset(io_manager_key="polars_parquet_io_manager")
def downstream(upstream: pandera.typing.polars.LazyFrame[MySchema]):
...
- :art: refactors logic around loading Optional, Dict, Eager, Lazy, and Pandera-typed DataFrames with new recursive
TypeRouter
(needs a better name?) helper class. This made the base IOManager logic much cleaner while supporting all combinations of these types. Also,TypeRouter
can potentially be used in theBigQuery
IOManager, and in general in other non-upath IOManagers. - :boom: [TBD] drops support for storing extra metadata in storage as it became too hard to maintain. Extra jsons can be stored in separate assets. This decision is debatable and not final.
How I Tested These Changes
Existing tests pass, proving the TypeRouter
recursive type resolution to be correct.
New tests for Pandera to be added.