Add support to compare.matches() to accept optional threshold
We have a number of use cases where 100% row match isn't required and 90% or another configurable value is permissible. Can there be support added to accept a custom threshold so that applications can configure failures when that threshold isn't met.
Could you share a bit more about your use case? It sounds like you're using datacompy for programmatic testing purposes, but it's really mainly intended to be used for manual data comparison.
Yea! We would integrate this library with an application that runs in production. The goal is to be able to programmatically use the tool to determine if dataframes match.
Right now we are able to do something like this
compare = SparkSQLCompare(base_df=df1, compare_df=df2, join_col='col1')
if not compare.matches():
raise Exception()
The above will do a 100% row match and fail when the dataframes have even one record that's different. For some use cases, 100% matches aren't required and it's permissible to have 90% of the rows match. We are looking to see if DataComPy could support such scenairo
it's really mainly intended to be used for manual data comparison.
Is there a reason why it shouldn't support programmatic usage? It is just a python library. We have a number of use cases where we need to verify data from migrations or upgrades that cause extreme user toil if users are forced to verify it manually.
it's really mainly intended to be used for manual data comparison.
Is there a reason why it shouldn't support programmatic usage? It is just a python library. We have a number of use cases where we need to verify data from migrations or upgrades that cause extreme user toil if users are forced to verify it manually.
There's nothing stopping you from doing so, but my understanding is that the original intent of datacompy is the generation of explicit human-readable output for people who want some sense of how their data differs, less so intended to be used as a testing library. @fdosani thoughts?
I'm good with the programatic execution here. It makes sense to me people would want to automate as much as possible within some thresholds etc.
We should refine the intent to make sure we align on what we want to do. Also we need to make sure it applies to all data frame types.
Fair enough 👍 - In terms of intent, there are 3 conditions for a match:
- The schema in both dataframes is identical
- All rows in both dataframes can be joined
- All rows comparable columns match exactly.
My understanding is the first condition still applies, and the third is adjusted by a threshold. The second is a bit less clear, its easier to just say it stays the same but maybe that's not quite the most accurate intent. @shreya-goddu in your use case, when you mention this threshold, do you consider rows that fail to be joined between the dataframes as part of the tolerable error? Or are you still expecting all rows to be able to be joined, but want to excuse some limited amount of matching failures?
Good questions. We will take this back and refine some of the intent further and bring it back. Leaving the issue open in the meantime