pandera
pandera copied to clipboard
Add CompareColumns subclass of Check for dataframe-level multi-column checks
The CompareColumns
subclasses of Check
are designed to work nicely with dataframe-level checks #14.
CompareColumns
class
This class enables built-in comparisons of two columns.
The proposed API for this would be something like:
DataFrameSchema(
columns={...},
checks=[
Compare("col1").greater_than("col2"),
Compare("col2").less_than_equal("col3"),
]
)
This will be an experimental API
bumping this issue in case someone in the community wants to take it on
Hey @cosmicBboy, this feature is still open right? Is there an alternative approach which can be used for now?
Hi @Nandha95 this feature is still in consideration, but work on it hasn't been prioritized. You can get the same effect by using wide checks: https://pandera.readthedocs.io/en/stable/checks.html#wide-checks
@cosmicBboy, I did try wide column checks, where I want to validate that two columns are equal. However, what seems to be happening is that all the columns are being checked against the column which I want to check.
I am setting up the check using the following logic
Check(lambda df: df['account_id'] == df['vendor_account_id']),
And the errors look like the following, a sample of the total number of columns I have.
@cosmicBboy, I did try wide column checks, where I want to validate that two columns are equal. However, what seems to be happening is that all the columns are being checked against the column which I want to check.
I am setting up the check using the following logic
Check(lambda df: df['account_id'] == df['vendor_account_id']),
And the errors look like the following, a sample of the total number of columns I have.
This behavior is actually nominal since the framework has no way of inferring which columns are involved in the wide check, and thus report failure for the entire row. Due to how the failure_case report is structured, this means one record per column per failed check instance.
You will probably have to keep track of which column is involved in which wide check, then manually interpret the failure_cases by removing irrelevant data.