datacompy icon indicating copy to clipboard operation
datacompy copied to clipboard

Major Refactor v1.0.0: Redesign for compare logic into modular "Comparators"

Open fdosani opened this issue 9 months ago • 11 comments

Had a look into the implementation -- the actual column comparison code (def columns_equal) seems rather unflexible/specifically built for use cases at Capital One. Here's two ideas how to deal with the NumPy array issue:

A) Add new fixed logic for NumPy arrays: try to detect NumPy array columns by looking at the actual series values. Use .all() for NumPy arrays.

B) Add a new system for custom declaration of "comparators", ie. give more flexibility to the user to configure how columns are compared. We would ship a default configuration that mimics the current behavior, and users would be free to change the configuration to their liking. This could be as simple as giving a list of comparators that are tried in order until one of them "understand" the data, ie. the user could pass something like:


columns_equal(..., comparators=[

    FloatComparator(rtol=1e-3),

    StringComparator(case_sensitive=False),

    ArrayComparator(aggregate="all")  # calls .all()

])

Or it could be an explicit list of comparators for each column, or something similar.

Originally posted by @jonashaag in #58

This was a old issue from a while back but want to revist and get some thoughts from folks on if this is something we should look into and or persue. the idea of compartmentalizing "comparators" from a design perspective feels cleaner and nice. This could also allow people to build out their own custom ones and tweak to their liking.

tagging the @capitalone/datacompy-write-team for their thoughts and opinions on this.

  • The initial set could use dispatching to house logic for all supported data types within say FloatComparator.
  • Some builtins could be:
    • Numeric
    • String
    • Date/String
    • Temporal

fdosani avatar Mar 28 '25 14:03 fdosani

I think this is a really cool idea - and for the benefits it provides I think it's actually a pretty low-effort change. Couple questions:

  • I'm assuming we'll need to support different sets of comparators for different dataframe mediums? I.e. Spark/Snowflake will probably require a different set of comparators than polars/pandas. Do we have any idea about how we want to go about that (i.e. write comparators directly, or if we want some type of interface that has the same underlying compare logic for similar datatypes across libraries?
  • Where will the default comparators live in datacompy? I'm assuming we'd define them outside the compare files themselves, since some of these comparators will probably be shared across a few different compare mediums?

rhaffar avatar Mar 31 '25 20:03 rhaffar

I think this is a really cool idea - and for the benefits it provides I think it's actually a pretty low-effort change. Couple questions:

  • I'm assuming we'll need to support different sets of comparators for different dataframe mediums? I.e. Spark/Snowflake will probably require a different set of comparators than polars/pandas. Do we have any idea about how we want to go about that (i.e. write comparators directly, or if we want some type of interface that has the same underlying compare logic for similar datatypes across libraries?
  • Where will the default comparators live in datacompy? I'm assuming we'd define them outside the compare files themselves, since some of these comparators will probably be shared across a few different compare mediums?

@rhaffar Thanks for looking into this. My thoughts:

  1. I was thinking maybe we could use dispatching here. So for example you would have a FloatComparator and then depending on dataframe being passed down do different logic? Thought on this?

  2. The common ones could just live in datacompy, and if people want to make custom ones we can provide a ingestion mechanism and a framework for extending them? Maybe a ABC to follow for consistency?

fdosani avatar Apr 09 '25 19:04 fdosani

I like the idea of modularity as well as dispatching as well. Maybe we could have a ABC class with a general compare method that does the actual comparison, which is also where the dispatching is done ie a FloatComparator.compare would perform the comparison based on the datatype(s) it supports.

gladysteh99 avatar Apr 10 '25 17:04 gladysteh99

I like the idea of modularity as well as dispatching as well. Maybe we could have a ABC class with a general compare method that does the actual comparison, which is also where the dispatching is done ie a FloatComparator.compare would perform the comparison based on the datatype(s) it supports.

@gladysteh99 Agreed. We could also make the Reporting part into a separate entity, incase maybe people want special or customized reporting? I guess part of it will be how can we minimize the disruption to existing processes and not alter the "contract" too much.

fdosani avatar Apr 10 '25 18:04 fdosani

@fdosani @gladysteh99

Agreed, I think that's a good setup overall - an ABC for standardizing what comparators look like and dispatching for handling similar but different types across different libraries. Then in the columns_equal calls in each compare class, we can just use the same conditionals we've been using to call the right comparator (or maybe use dispatching for that too to just directly call the right comparator)?

For the reporting piece, I could see that being a nice feature too though taking more effort to do correctly. Users have pretty much everything they need to create their own report from the results of the compare, so I imagine we'd want to figure out how to make it useful enough that users opt to use it instead of just writing their own reporting method/utility.

rhaffar avatar Apr 11 '25 18:04 rhaffar

@fdosani @rhaffar

Yeah I second with the idea of using the current conditionals to minimize disruption for the default comparators. Otherwise if users supply them, we can probably loop through each of them in the order provided by the users? I imagine giving the users the flexibility on sequencing the comparators could be helpful for the power users.

I think reporting could be helpful too - though I wouldn't worry too much about the usefulness of a default reporting function since it's already provided to them.

gladysteh99 avatar Apr 14 '25 18:04 gladysteh99

@rhaffar @ak-gupta @gladysteh99 I have 2 implementations here we can discuss: https://github.com/capitalone/datacompy/compare/v1-refactor

Single dispatch verison: https://github.com/capitalone/datacompy/blob/v1-refactor/datacompy/comparator/string.py Individual class based: https://github.com/capitalone/datacompy/blob/v1-refactor/datacompy/comparator/numeric.py

fdosani avatar Jun 04 '25 23:06 fdosani

Thanks for putting these together Faisal!

Initial Impression: I have a personal preference for the class based solution here. I think ultimately whatever we choose should be intuitively overridable for users - as I'm seeing the dispatching solution now, I'm thinking it might be awkward to override comparisons along groupings of datatypes. Intuitively if someone is looking to bring their own custom functionality, they're generally operating on one of the libraries DataComPy supports, not across them. Given that I think overriding or implementing on something like Polars_Compare.NumericComparator, or PolarsNumericComparator makes more sense than NumericComparator.Polars_Compare - though I think all this is more semantic than anything else.

One separate thought for the class-based solution - maybe these don't need to be classes? It looks like we are mainly using the class structure for storing tolerance as an attribute, but I think that might be fine just being set as a function argument - let me know your thoughts on this?

I think one thing which isn't obvious to me in these solutions is how we can compare across datatypes, which is very difficult without multiple dispatching for the dispatch solution, and doesn't fit easily into the class-based solution. Wondering if it just makes more sense to just drop that functionality to help make the implementation simpler?

rhaffar avatar Jun 09 '25 21:06 rhaffar

I'm personally leaning towards the class implementation. To me it seems a bit cleaner and more compartmentalized.

fdosani avatar Jun 11 '25 14:06 fdosani

I prefer the dispatch implementation for the built-ins we provide mainly due to the maintainability for each datatype (string, numeric etc.) by enforcing the same signature/arguments across the libraries. The current class implementation in numeric.py gives a little too much flexibility for divergence across the library as we develop/maintain the module (eg. adding removing some base arguments (eg. rel_tol) in PolarsNumericCompare in the future but the same remains in other NumericCompare classes.

In terms of user flexibility/overriding classes, if we go with the dispatch method, they could easily just build a subclass of BaseComparator. Otherwise, I think to balance between the two versions could be a class-based version where for each datatype we have a DtypeComparator base class (eg. NumericComparator), and each of the library comparator is a child class of DtypeComparator (eg. PolarsNumericComparator(NumericComparator)). This ensures that at least each of the Library specific classes can still take in the default arguments for the datatype.

gladysteh99 avatar Jun 11 '25 20:06 gladysteh99

Some other things we've been thinking about and discussing:

  • Dropping fugue support, and by extension: dask, duckdb, ray. Not seeing a lot of usage or support for some of these types.
  • Adding in native support for duckdb, or maybe a more generic pyarrow compare.
  • Making spark and snowflake non-optional dependencies

fdosani avatar Jun 20 '25 14:06 fdosani