dataframe icon indicating copy to clipboard operation
dataframe copied to clipboard

Comparing two data frame

Open brindasanth opened this issue 1 year ago • 2 comments

I have two data frames having same same schema, Is there way to compare the two data frames ? so that it provide the added , deleted and modified rows. It may take some single/group of Key columns and Ignore columns.

brindasanth avatar Apr 15 '24 09:04 brindasanth

Hi! We don't have such functionality at the moment, but it might be a handy addition.

Tracking additions, deletions, and modifications, similar to how git would do it, requires a special algorithm. I suppose Myer's Differencing Algorithm could help.

I just tried this algorithm via https://github.com/andrewbailey/Difference on two dataFrames (as List<DataRow<*>>) which correctly provides the remove/move/add operations that likely occurred between the two dataframes.

We could wrap a library like that in the future to introduce this behavior to DataFrame natively, but in the meantime, you could try that library as well :)

Jolanrensen avatar Apr 15 '24 11:04 Jolanrensen

Thanks for your comments and adding in Backlog.

brindasanth avatar Apr 15 '24 14:04 brindasanth

We can consider adding a new compare method that will allow us to compare two dataframes flexibly Inspired by https://capitalone.github.io/datacompy/spark_usage.html

devcrocod avatar Jul 29 '24 12:07 devcrocod

Hey, I would like to work on this issue.

CarloMariaProietti avatar Oct 28 '25 16:10 CarloMariaProietti

@CarloMariaProietti This feature will likely not be considered for 1.0, rather 1.1. I'm curious to see how it will work out, as it's a completely new concept to DataFrame. This might also mean we will not accept your approach directly but rather adapt it here and there to our vision.

If you're okay with that, you can give it a try :)

Jolanrensen avatar Oct 29 '25 10:10 Jolanrensen

That's fine. For the moment i implemented a draft. I thought of a static function that receives two DataFrames<T> with the same schema, and returns an Iterable<DataRow<T>> that contains each row belonging to df1 and not belonging to df2 and vice-versa

CarloMariaProietti avatar Oct 29 '25 14:10 CarloMariaProietti

It looks like that:

internal fun <T> compareImpl(df1: DataFrame<T>, df2: DataFrame<T>): Iterable<DataRow<T>> {
    val result = ArrayList<DataRow<T>>()
    val rowsDf1 = df1.rows().toList()
    val rowsDf2 = df2.rows().toMutableList()
    for (row in rowsDf1) {
        var check = false
        for(comparedRow in rowsDf2) {
            if(row.equals(comparedRow) && !check) {  //assuming i have a proper equals (not yet)
                rowsDf2.remove(comparedRow) //otherwise one df1's row may match more df2's rows
                check = true
            }
        }
        //i did not find a match
        if(!check) {
            result.add(row)
        }
    }
    return result
}

CarloMariaProietti avatar Oct 29 '25 14:10 CarloMariaProietti

@CarloMariaProietti Thanks! I think an approach like that could work, but it leaves out some vital information:

  • What if there's only one value in the row that's different? It now shows up as a complete mismatch, instead of specifically where it's different, and (potentially) what the difference in value is.
  • What if rows have been moved? In your approach they don't show up at all. Difference does provide a way to see that.
  • And what about columns? (Okay, this could be split up in a separate function, as this function only accepts DFs with the same schema, but Spark reports do include it.)
  • Plus, the runtime of this is O(n^2), (expected running time is a bit better since you remove rows from rowsDf2), this is slower than Difference especially for larger datasets
  • Rows that are in df2 and not in df1 don't appear in the result (unless you run it like compare(df1, df2) + compare(df2, df1), but then all row combinations are checked twice)
    • and also, how will I know which DF the mismatched rows are from?
  • How do we want to present the results? Maybe as a DataFrame akin to the .describe() DataFrame? That could hold more information than simply a list of rows.

All sorts of things to consider :)

Jolanrensen avatar Oct 29 '25 15:10 Jolanrensen

Thank you for the explanation! I'll take a look at Difference to understand its behaviour.

CarloMariaProietti avatar Oct 29 '25 18:10 CarloMariaProietti

Hey, sorry for the long time since the last update. I have been studying and implementing Myers difference algorithm in order to compare dataframes. PR #1556 is a draft with the implementation I made.

CarloMariaProietti avatar Nov 10 '25 17:11 CarloMariaProietti

Hey, the implementation is now complete and not a draft anymore. Every detail is in #1556. I'm at full disposal for any explanation and change.

CarloMariaProietti avatar Nov 18 '25 08:11 CarloMariaProietti

@CarloMariaProietti please remind me again in a week if I forget to look at it :) Currently preparing for a talk in Amsterdam for Kotlin Dev Day, so I have not much time

Jolanrensen avatar Nov 21 '25 11:11 Jolanrensen