Comparing two data frame
I have two data frames having same same schema, Is there way to compare the two data frames ? so that it provide the added , deleted and modified rows. It may take some single/group of Key columns and Ignore columns.
Hi! We don't have such functionality at the moment, but it might be a handy addition.
Tracking additions, deletions, and modifications, similar to how git would do it, requires a special algorithm. I suppose Myer's Differencing Algorithm could help.
I just tried this algorithm via https://github.com/andrewbailey/Difference on two dataFrames (as List<DataRow<*>>) which correctly provides the remove/move/add operations that likely occurred between the two dataframes.
We could wrap a library like that in the future to introduce this behavior to DataFrame natively, but in the meantime, you could try that library as well :)
Thanks for your comments and adding in Backlog.
We can consider adding a new compare method that will allow us to compare two dataframes flexibly
Inspired by https://capitalone.github.io/datacompy/spark_usage.html
Hey, I would like to work on this issue.
@CarloMariaProietti This feature will likely not be considered for 1.0, rather 1.1. I'm curious to see how it will work out, as it's a completely new concept to DataFrame. This might also mean we will not accept your approach directly but rather adapt it here and there to our vision.
If you're okay with that, you can give it a try :)
That's fine. For the moment i implemented a draft. I thought of a static function that receives two DataFrames<T> with the same schema, and returns an Iterable<DataRow<T>> that contains each row belonging to df1 and not belonging to df2 and vice-versa
It looks like that:
internal fun <T> compareImpl(df1: DataFrame<T>, df2: DataFrame<T>): Iterable<DataRow<T>> {
val result = ArrayList<DataRow<T>>()
val rowsDf1 = df1.rows().toList()
val rowsDf2 = df2.rows().toMutableList()
for (row in rowsDf1) {
var check = false
for(comparedRow in rowsDf2) {
if(row.equals(comparedRow) && !check) { //assuming i have a proper equals (not yet)
rowsDf2.remove(comparedRow) //otherwise one df1's row may match more df2's rows
check = true
}
}
//i did not find a match
if(!check) {
result.add(row)
}
}
return result
}
@CarloMariaProietti Thanks! I think an approach like that could work, but it leaves out some vital information:
- What if there's only one value in the row that's different? It now shows up as a complete mismatch, instead of specifically where it's different, and (potentially) what the difference in value is.
- What if rows have been moved? In your approach they don't show up at all. Difference does provide a way to see that.
- And what about columns? (Okay, this could be split up in a separate function, as this function only accepts DFs with the same schema, but Spark reports do include it.)
- Plus, the runtime of this is O(n^2), (expected running time is a bit better since you remove rows from
rowsDf2), this is slower than Difference especially for larger datasets - Rows that are in
df2and not indf1don't appear in the result (unless you run it likecompare(df1, df2) + compare(df2, df1), but then all row combinations are checked twice)- and also, how will I know which DF the mismatched rows are from?
- How do we want to present the results? Maybe as a DataFrame akin to the
.describe()DataFrame? That could hold more information than simply a list of rows.
All sorts of things to consider :)
Thank you for the explanation! I'll take a look at Difference to understand its behaviour.
Hey, sorry for the long time since the last update. I have been studying and implementing Myers difference algorithm in order to compare dataframes. PR #1556 is a draft with the implementation I made.
Hey, the implementation is now complete and not a draft anymore. Every detail is in #1556. I'm at full disposal for any explanation and change.
@CarloMariaProietti please remind me again in a week if I forget to look at it :) Currently preparing for a talk in Amsterdam for Kotlin Dev Day, so I have not much time