pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Add CompareColumns subclass of Check for dataframe-level multi-column checks

Open cosmicBboy opened this issue 5 years ago • 1 comments

The CompareColumns subclasses of Check are designed to work nicely with dataframe-level checks #14.

CompareColumns class

This class enables built-in comparisons of two columns.

The proposed API for this would be something like:

DataFrameSchema(
    columns={...},
    checks=[
        Compare("col1").greater_than("col2"),
        Compare("col2").less_than_equal("col3"),
    ]
)

This will be an experimental API

cosmicBboy avatar Jun 12 '19 04:06 cosmicBboy

bumping this issue in case someone in the community wants to take it on

cosmicBboy avatar Jan 07 '22 04:01 cosmicBboy

Hey @cosmicBboy, this feature is still open right? Is there an alternative approach which can be used for now?

Nandha95 avatar Jul 18 '23 11:07 Nandha95

Hi @Nandha95 this feature is still in consideration, but work on it hasn't been prioritized. You can get the same effect by using wide checks: https://pandera.readthedocs.io/en/stable/checks.html#wide-checks

cosmicBboy avatar Jul 18 '23 14:07 cosmicBboy

@cosmicBboy, I did try wide column checks, where I want to validate that two columns are equal. However, what seems to be happening is that all the columns are being checked against the column which I want to check.

I am setting up the check using the following logic Check(lambda df: df['account_id'] == df['vendor_account_id']), And the errors look like the following, a sample of the total number of columns I have. image

Nandha95 avatar Jul 20 '23 10:07 Nandha95

@cosmicBboy, I did try wide column checks, where I want to validate that two columns are equal. However, what seems to be happening is that all the columns are being checked against the column which I want to check.

I am setting up the check using the following logic Check(lambda df: df['account_id'] == df['vendor_account_id']), And the errors look like the following, a sample of the total number of columns I have. image

This behavior is actually nominal since the framework has no way of inferring which columns are involved in the wide check, and thus report failure for the entire row. Due to how the failure_case report is structured, this means one record per column per failed check instance.

You will probably have to keep track of which column is involved in which wide check, then manually interpret the failure_cases by removing irrelevant data.

splatpope avatar Aug 09 '23 13:08 splatpope