dagster
dagster copied to clipboard
[docs] - Updage Pandera guide to include all supported schema definition methods
Summary
Update the Pandera guide to:
- Describe the supported schema definition options (
SchemaModelandDataFrameSchema) - Include info about column names containing spaces + using the
DataFrameSchemaapproach - Include a link to the Pandera docs about
DataFrameSchema: https://pandera.readthedocs.io/en/stable/
Issue from the Dagster Slack
This issue was generated from the slack conversation at: https://dagster.slack.com/archives/C01U954MEER/p1654667850385989?thread_ts=1654667850.385989&cid=C01U954MEER
Conversation excerpt
U03G3ND6C03: Hi all, I'm looking to use pandera to validate my SDA's. I'm looking to validate my raw data assets, which are straight dumps of the source data. However, there are spaces in the raw data field names, and I'm looking to use the dagster-pandera API which looks like below. Is there a way to overcome the spaces, preferably without changing the raw column names?
class Member_Schema(pa.SchemaModel):
# col_name: Series[expected data type] - pa.Field()
client number: Series[float64] = pa.Field()
account number: Series[object] = pa.Field()
U015C9U9RLK: <@U018K0G2Y85> issue dagster-pandera doesn’t handle spaces in col names
U01GTMVMGQH: Hi Barry, dagster-pandera supports either of pandera’s formats for defining a dataframe schema-- the SchemaModel approach (which is illustrated in your snippet) and the pa.DataFrameSchema approach. For columns with spaces, you should use the pa.DataFrameSchema approach:
from dagster_pandera import pandera_schema_to_dagster_type
import pandera as pa
member_schema = pa.DataFrameSchema(
{
"client number": pa.Column(float),
"account number": pa.Column(object)
}
)
df_type = pandera_schema_to_dagster_type(member_schema)
See Pandera docs for more on the DataFrameSchema object.
U03G3ND6C03: Ok sweet! So am I able to pass in the member_schema to my asset like so? It should work for either format of the schema?
@asset(dagster_type=pandera_schema_to_dagster_type(Member_Schema))
Message from the maintainers:
Do you care about this too? Give it a :thumbsup:. We factor engagement into prioritization.
This was a question about whether dagster-pandera supports something (it does). Solution here is to improve docs by linking the API doc from the guide and also emphasizing that either schema-defining approach is supported.