chispa icon indicating copy to clipboard operation
chispa copied to clipboard

Use eqNullSafe instead of collect

Open rragundez opened this issue 5 years ago • 2 comments

Since Spark 2.3 there is the Pyspark function eqNullSafe, this seems a much better way to compare columns and also can be used to compare dataframes.

Advantages:

  • It comes form the main library hence no need to adjust Chispa if later on the library decides to change the way dataframes interact with collect
  • Solves the NaN and Null problem

For dataframe it would mean that there has to be some sort of loop over columns and then a reduce to check all member of the resulting column are true. I think it is worth the change due to the 2 reasons given above,

rragundez avatar Dec 05 '20 06:12 rragundez

certainly this will be so much clearer and main maintenance will be on PySpark itself.

rragundez avatar Dec 05 '20 06:12 rragundez

@rragundez - thanks for creating this issue.

I could see how eqNullSafe could be useful, especially for large column comparison operations. You could do something like df.withColumn("are_cols_equal", col1.eqNullSafe(col2)) and then run a filtering operation and make sure are_cols_equal is always equal to true. I did something similar to this in spark-fast-tests but don't really use this implementation of the method. Should do some more benchmarking and see if this is faster.

Is this what you're suggesting?

MrPowers avatar Mar 27 '21 14:03 MrPowers