pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Add option to return dataframe with columns in order specified in DataFrameSchema.

Open Smartitect opened this issue 2 years ago • 17 comments

Problem

When handling writing Spark dataframes to datalake storage, the order of the columns in the dataframe is important. For example if a pipeline is appending parquet files in the lake, if the columns are not written in a consistent order this can lead to corrupt data on read. We can use Pandera to check when columns are not in the right order, but it would be really useful to also use Pandera to "coerce" the columns into the correct order during validation to avoid having to fail the pipeline altogether or add additional steps into the process to ensure the columns are in a consistent order.

Solution that I'd like

Having a further option on the DataFrameSchema ordered parameter as follows:

ordered (Union[bool, Literal[‘coerce’]]) – whether or not to validate the columns order. If set to ‘coerce’, the columns validated dataframe will be returned in the order specified in the schema.

Alternatives considered

Adding an extra step to pyspark.sql("SELECT ...") all columns in the correct order from the dataframe before writing to the lake. This can use a helper function to build the pyspark.sql("SELECT ...") statement based on the column name order as they appear in the Pandera DataFrameSchema to avoid duplication of schema definition. But it would be good to avoid this step!

Additional context

Currently writing blog about using Pandera on Azure Synapse and Microsoft Fabric. This is one of the features I spotted that would be very useful.

Smartitect avatar Aug 15 '23 18:08 Smartitect

This would be really useful. There are also some cases in which during inference one or more columns are missing. It would be great to make sure that if add_missing_columns=True and ordered="coerce", we first add the missing ones and then reorder them.

AndreaPiccione avatar Nov 29 '23 10:11 AndreaPiccione

Yeah this would be a massive QoL improvement. I'd love to delete the section of our code which does the manual column sorting and just allow pandera to enforce it

Samreay avatar Nov 29 '23 22:11 Samreay

This would be great. We always have to add this line after validation - would be nice to just have it baked into the schema definition. df = df[schema.columns.keys()]

JeremyL-01 avatar Mar 21 '24 21:03 JeremyL-01

A contribution would have my blessing! 🙏

So the strict_filter_columns parser uses the ordered=True kwarg to raise an error. Would just have to add the ordering functionality to that method. Open to ideas on how this would behave... would simply strict="<some_option>" make sense to add? Or could we just assume we want an ordered dataframe if strict=True|"filter" + ordered=True?

cosmicBboy avatar Mar 22 '24 07:03 cosmicBboy

@cosmicBboy I think the OP's idea of having ordered="coerce" sounded sensible to me. strict="filter" + ordered="coerce' effectively is df = df[schema.columns.keys()]. Both of these options imply transformations: 1) filter columns to match schema, 2) coerce order of columns to match schema.

strict=False + ordered="coerce" I imagine would just order the columns in the schema, pushing the rest of the columns to the end?

JeremyL-01 avatar Mar 24 '24 21:03 JeremyL-01

I kinda don't want to re-use the term "coerce" here. coerce=True is already used to convert datatypes. Is there another verb we can use?

cosmicBboy avatar Mar 25 '24 01:03 cosmicBboy

@cosmicBboy... makes sense.... here are some other ideas: ordered="reorder", ordered="match", ordered="arrange", ordered="sync", ordered="conform", ordered="align"

JeremyL-01 avatar Mar 25 '24 02:03 JeremyL-01

thoughts on the proposals ^^ @Smartitect @AndreaPiccione @Samreay ?

cosmicBboy avatar Mar 25 '24 03:03 cosmicBboy

Any one of those synonyms makes sense to me, though in general my preference as a user would be to limit str-like kwargs going in. Type hinting and IDE completion are easier when you don't have to look up valid string inputs, so I'd also be happy with a new kwarg reorder_columns=True just like you have add_missing_columns

This also has the benefit of not requiring any changing to the simple type hint and backward compatibility of the ordered kwarg.

Samreay avatar Mar 25 '24 04:03 Samreay

I agree with @Samreay. Having a str-like kwarg would be intuitive, but would be another string to remember when creating the schema.

AndreaPiccione avatar Mar 25 '24 09:03 AndreaPiccione

+1 to reorder_columns=True.

On a related note, I'd love if strict="filter" would be deprecated in favor of remove_extra_columns=True

cosmicBboy avatar Mar 25 '24 19:03 cosmicBboy

I need this function too, hope it will be completed soon.

shaoyucheng avatar May 06 '24 07:05 shaoyucheng

any update on this functionality? I recently came across the need for it and it would be a big help.

ahsin-s avatar Jun 26 '24 13:06 ahsin-s

I don't believe anyone's working on this. Anyone who wants to make a PR for the reorder_columns=True solution has my blessing!

cosmicBboy avatar Jun 26 '24 14:06 cosmicBboy

I wanted something similar and did this:

https://github.com/unionai-oss/pandera/pull/1774

That includes an example of my general use case where I reference an upstream schema and add/remove/update columns, then I select the columns in the order I need them to be in for my Delta Table. I use this schema when creating the delta table and use the columns.keys() to reorder my Polars Dataframe etc. (By default the delta tables collect stats on the first 32 columns in the table for data filtering so I am in the habit of keeping those potential filter columns towards the beginning of the table)

ldacey avatar Aug 03 '24 21:08 ldacey

Might be able to close this now since we will be able to use .select_columns to explicitly reorder the schema which we can then use to reorder the dataframe etc.

ldacey avatar Aug 15 '24 20:08 ldacey