Daft icon indicating copy to clipboard operation
Daft copied to clipboard

dataframe equality

Open universalmind303 opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please describe.

I'd like to be able to compare if dataframes are equal to one another.

import daft
import numpy as np
arr = np.arange(100)
df1 = daft.from_pydict({"a": arr})
df2 = daft.from_pydict({"a": arr, })

assert df1 == df2
# AssertionError

Describe the solution you'd like I think there's a few things to consider here. Since dataframes can either be loaded/unloaded we'd probably have to have some logic to check a few things before checking the actual values.

  1. Are they both loaded/unloaded
  2. Are the schemas equal
  3. Is any other metadata different?
  4. are the counts the same
  5. finally start comparing values.

I think using the __eq__ method is fine, but a .equals method would allow for more configuration such as null handling

df1.equals(df2)
df1.equals(df2, null_eq=True)

Describe alternatives you've considered manually compare dataframes.

universalmind303 avatar Jul 31 '24 22:07 universalmind303

Any thoughts also on partitioning? They could contain the same data (and same order) globally, but partitioning might differ.

I feel like perhaps the safest option might just be to compare the logical plans...

jaychia avatar Jul 31 '24 23:07 jaychia