Use eqNullSafe instead of collect
Since Spark 2.3 there is the Pyspark function eqNullSafe, this seems a much better way to compare columns and also can be used to compare dataframes.
Advantages:
- It comes form the main library hence no need to adjust Chispa if later on the library decides to change the way dataframes interact with collect
- Solves the NaN and Null problem
For dataframe it would mean that there has to be some sort of loop over columns and then a reduce to check all member of the resulting column are true. I think it is worth the change due to the 2 reasons given above,
certainly this will be so much clearer and main maintenance will be on PySpark itself.
@rragundez - thanks for creating this issue.
I could see how eqNullSafe could be useful, especially for large column comparison operations. You could do something like df.withColumn("are_cols_equal", col1.eqNullSafe(col2)) and then run a filtering operation and make sure are_cols_equal is always equal to true. I did something similar to this in spark-fast-tests but don't really use this implementation of the method. Should do some more benchmarking and see if this is faster.
Is this what you're suggesting?