Possibly set row comparison to true by default for DataFrame comparisons

Open MrPowers opened this issue 4 years ago • 1 comments

Maybe I'm the outlier, but I consider the more intuitive check -- especially for testing purposes -- to ignore order. If some function produces a DataFrame that I want to check, I care about the contents. And by default, Spark offers no guarantees on row order unless your plan has an explicit .orderBy(). So relying on the stability of row order in the absence of an explicit order by clause is a recipe for surprises, much like it is in SQL.

In fact, I don't think .collect() even provides any guarantees that the row order of the resulting array will match the row order of the original DataFrame---again, unless the DataFrame has an explicit ordering specified. It's theoretically possible, for example, that you could call spark.range(3).collect() twice and get different row orders each time. So if you're relying on .collect() to preserve order without explicit ordering on the original DataFrames, then I would say that's technically incorrect.

By the way, in your own usages of this library (or the Scala equivalent), how often do you compare DataFrames where you care about the row order? I'm curious to see a few examples of that.

Originally posted by @nchammas in https://github.com/MrPowers/chispa/pull/19#discussion_r603466453

Mar 29 '21 18:03 MrPowers

Seem reasonable to me. Usually when comparing two dataframe, unless my transformation perform sort, I never expect the row order to be compared.

Apr 19 '24 23:04 zeotuan