dagster icon indicating copy to clipboard operation
dagster copied to clipboard

[docs] - Updage Pandera guide to include all supported schema definition methods

Open dagsir[bot] opened this issue 3 years ago • 1 comments

Summary

Update the Pandera guide to:

  • Describe the supported schema definition options (SchemaModel and DataFrameSchema)
  • Include info about column names containing spaces + using the DataFrameSchema approach
  • Include a link to the Pandera docs about DataFrameSchema: https://pandera.readthedocs.io/en/stable/

Issue from the Dagster Slack

This issue was generated from the slack conversation at: https://dagster.slack.com/archives/C01U954MEER/p1654667850385989?thread_ts=1654667850.385989&cid=C01U954MEER


Conversation excerpt

U03G3ND6C03: Hi all, I'm looking to use pandera to validate my SDA's. I'm looking to validate my raw data assets, which are straight dumps of the source data. However, there are spaces in the raw data field names, and I'm looking to use the dagster-pandera API which looks like below. Is there a way to overcome the spaces, preferably without changing the raw column names?

class Member_Schema(pa.SchemaModel):
  # col_name: Series[expected data type] - pa.Field()
  client number: Series[float64] = pa.Field()
  account number: Series[object] = pa.Field()

U015C9U9RLK: <@U018K0G2Y85> issue dagster-pandera doesn’t handle spaces in col names

U01GTMVMGQH: Hi Barry, dagster-pandera supports either of pandera’s formats for defining a dataframe schema-- the SchemaModel approach (which is illustrated in your snippet) and the pa.DataFrameSchema approach. For columns with spaces, you should use the pa.DataFrameSchema approach:

from dagster_pandera import pandera_schema_to_dagster_type
import pandera as pa
member_schema = pa.DataFrameSchema(
    {
        "client number": pa.Column(float),
        "account number": pa.Column(object)
    }
)

df_type = pandera_schema_to_dagster_type(member_schema)

See Pandera docs for more on the DataFrameSchema object.

U03G3ND6C03: Ok sweet! So am I able to pass in the member_schema to my asset like so? It should work for either format of the schema?

@asset(dagster_type=pandera_schema_to_dagster_type(Member_Schema))

Message from the maintainers:

Do you care about this too? Give it a :thumbsup:. We factor engagement into prioritization.

dagsir[bot] avatar Jun 08 '22 15:06 dagsir[bot]

This was a question about whether dagster-pandera supports something (it does). Solution here is to improve docs by linking the API doc from the guide and also emphasizing that either schema-defining approach is supported.

smackesey avatar Jun 08 '22 18:06 smackesey