datacompy Add support to compare.matches() to accept optional threshold

We have a number of use cases where 100% row match isn't required and 90% or another configurable value is permissible. Can there be support added to accept a custom threshold so that applications can configure failures when that threshold isn't met.

Apr 08 '25 18:04 shreya-goddu

Could you share a bit more about your use case? It sounds like you're using datacompy for programmatic testing purposes, but it's really mainly intended to be used for manual data comparison.

Apr 10 '25 17:04 rhaffar

Yea! We would integrate this library with an application that runs in production. The goal is to be able to programmatically use the tool to determine if dataframes match.

Right now we are able to do something like this

compare = SparkSQLCompare(base_df=df1, compare_df=df2, join_col='col1')
if not compare.matches():
    raise Exception()

The above will do a 100% row match and fail when the dataframes have even one record that's different. For some use cases, 100% matches aren't required and it's permissible to have 90% of the rows match. We are looking to see if DataComPy could support such scenairo

Apr 10 '25 20:04 shreya-goddu

it's really mainly intended to be used for manual data comparison.

Is there a reason why it shouldn't support programmatic usage? It is just a python library. We have a number of use cases where we need to verify data from migrations or upgrades that cause extreme user toil if users are forced to verify it manually.

Apr 10 '25 20:04 shreya-goddu

it's really mainly intended to be used for manual data comparison.

Is there a reason why it shouldn't support programmatic usage? It is just a python library. We have a number of use cases where we need to verify data from migrations or upgrades that cause extreme user toil if users are forced to verify it manually.

There's nothing stopping you from doing so, but my understanding is that the original intent of datacompy is the generation of explicit human-readable output for people who want some sense of how their data differs, less so intended to be used as a testing library. @fdosani thoughts?

Apr 11 '25 18:04 rhaffar

I'm good with the programatic execution here. It makes sense to me people would want to automate as much as possible within some thresholds etc.

We should refine the intent to make sure we align on what we want to do. Also we need to make sure it applies to all data frame types.

Apr 11 '25 18:04 fdosani

Fair enough 👍 - In terms of intent, there are 3 conditions for a match:

The schema in both dataframes is identical
All rows in both dataframes can be joined
All rows comparable columns match exactly.

My understanding is the first condition still applies, and the third is adjusted by a threshold. The second is a bit less clear, its easier to just say it stays the same but maybe that's not quite the most accurate intent. @shreya-goddu in your use case, when you mention this threshold, do you consider rows that fail to be joined between the dataframes as part of the tolerable error? Or are you still expecting all rows to be able to be joined, but want to excuse some limited amount of matching failures?

Apr 11 '25 19:04 rhaffar

Good questions. We will take this back and refine some of the intent further and bring it back. Leaving the issue open in the meantime

Apr 17 '25 16:04 shreya-goddu