datacompy
datacompy copied to clipboard
Align the APIs between Compare and SparkCompare
Not sure if it makes sense to go all the way to subclassing or ABCs, but the API calls between Compare
and SparkCompare` are quite different. I think they could be aligned somewhat before adding any new functionality.
Have been poking around with some preliminary stuff here in the combine
branch. To break this down a little bit:
- [x] Change names on SparkCompare to df1/df2 instead of base/compare
- [x] Move non-compare functions out of core.py into utils.py
- [x] Move core.py to use the default stdout print, one report function that calls a bunch of others (text into templates too)
- [ ] Add in tests on the report rendering for Compare - right now it just checks that it renders at all
- [x] Convert sparkcompare report to follow the same pattern as Compare - mostly renaming, but may have to pull the merge step out to match with Compare
- [x] Make sparkcompare report out the dataset names
- [ ] Add in column translations for Compare?
- [x] Add in sampling for SparkCompare report
- [x] redefine SparkCompare as
class SparkCompare(Compare)
- [ ] (maybe) align testing across Pandas and Spark? Like have tests that should generate the same output in both?
- [ ] Print out unique columns in both, with dtypes
- [ ] Print count of dupes in both
- [ ] Add in show_all_columns option for Compare?
- [ ] Remove
match_column
fromcolumn_stats
@fdosani and @theianrobertson, this was a long-standing bugaboo of mine, and I think I'm going to put some time into maybe finishing this off. Since there are some judgment calls in this sort of alignment, I wanted to check to see if there are any changes that have been made so far or are planned that are fundamental here, vs. what's still open for discussion.
I'll weigh in with further thoughts once I look at what's happened so far and what's yet to be done, but wanted to orient myself to what you've already thought of here.
I just went through a lengthy conflict resolution to make combine
up to date with master
so I got a little taste of the changes... my main question off the bat is whether or not we should use ABCs here. I tend to think we should, and here's why:
Compare
is a pretty large beast. It defines many internal and external methods and properties - it looks like there may be 50 of them. Simply understanding what may be in one class but not the other will likely require a spreadsheet and manually ticking off attributes in both classes.
As an example of how that's a problem, if I want to introduce another subclass of Compare
, there isn't a good template or trait to help me understand what I need to implement. I basically have to go through every line of code in the parent class and see what the public attributes are to make sure I want to implement those. Furthermore, I suspect there is little actual inheritance from Compare into SparkCompare - I imagine all of the functionality is implemented uniquely across both classes. And having SparkCompare inherit from Compare raises the risk of forgetting to implement some property or method on the Spark side and accidentally ending up with the pandas version which won't work.
For these reasons, I think it would be clearer if there was some kind of abstract base class that defined the interface for both pandas and Spark sides that both inherit from. I'm not picky if this is an ABC or just some empty parent class. But I think that would give a concise definition for what the API needs to be for a Comparison class between two dataframes. Let me know if you agree that would be a positive change, and I can think about how best to retrofit that into the work already done.
Hey Dan!
I got distracted and haven't done much on this in longer than I'd care to admit. My efforts so far were to use Combine
as the base class, and try to have most of the complex logic call out to specific functions that could be implemented differently in Pandas/Spark, i.e. function df_row_count('df1')
would have to be implemented differently in Compare
and SparkCompare
, but could then be called in the actual comparison/reporting functions.
e.g. the actual compare
function would just call out to hidden methods for the different parts of the report, i.e.:
def report(self, sample_count=10, file=sys.stdout):
self._pre_report()
self._report_header(file)
self._report_column_summary(file)
...
Then some of those _report_header
e.g. functions would have to be re-implemented for Spark (or not, if they're generic enough).
Taking a step back, I think I agree with you that a base class that's implemented twice (or more than twice if we wanted to extend?) would be a good way to go here - I was having some trouble keeping track of what functions I needed to implement in different places too :grin:. I think there's a non-zero amount of logic that could be done in one place though, i.e. things like that report
function should live in the base class that both of the interfaces inherit from, and just implement the unique stuff. Do you have anything you've started playing around with for an abstract base, or want to take a crack? I'd be happy to tag team this, maybe we could try and connect for a remote pair sometime?
I haven't started anything yet - I wanted to make sure it was an acceptable idea. :) I think my next step will be to orient myself around all the attributes that would be part of a base class, looking across both pandas and Spark versions, since it's been quite a while since I've looked at this code (and I'm less familiar with the pandas stuff).
How about I do that and then we go through to make sure we're aligned on what the base API should look like (including how some of the public functions can be implemented by instance-specific implementations, like you mentioned), and then we can go through that together to make sure we're aligned? Once we have the interface solid, then implementing it should hopefully be straightforward.
I think having a base class with as much functionality and defining a common API makes a ton of sense. Happy to contribute to this as well if you guys want a hand.
Alright, I just went through the inventory and between the two classes we have 104 different class parameters, methods, attributes, and properties to align. :) Very little is actually the same right now, which actually makes me think this is issue quite valuable, since they should be doing the same stuff and a lot of this is just arbitrarily different.
I have it in a Google Sheet right now - let me figure out how to share it since I won't be able to make something public from within Capital One's GSuite and I can't access my personal Drive from work.
My next step will be to look through what @theianrobertson has done so far in the compare branch in terms of alignment, form my own opinions, and create a first draft of my recommendations for what a unified API should look like. We can then have a discussion (probably live) about how to come to a consensus and what changes that will drive.
Sounds good, thanks Dan. This is definitely a big lift so let's figure out a way to split up the work.
I've been trying to figure out how to share the API comparison spreadsheet I created in a way that won't get blocked by our work policies... I think I found out an option with my personal Google Drive: https://drive.google.com/file/d/0B0wINV0NXt63VkFtcFJydi1iN1NxejlfQU56bXNLT2Q0TlpJ/view?usp=sharing
That's open to the internet for commenting but you'll have to request access to edit.
I've been unexpectedly slowed down with actual work the past week, but will continue giving my opinion on how to align these 100+ attributes and methods. Feel free to add a column or comments with your own thoughts too.
I think while this is old, it is still relevant and something which needs to be done. Something we should try to do by YE of 2021 given bandwith.
@elzzhu @ak-gupta This has been a long pending issue. I'm wondering maybe koalas is the right options here to sort of cleanup and consolidate the spark + pandas API? Would love some thoughts or ideas on this.
Closing in favour of #197