dataframe equality
Is your feature request related to a problem? Please describe.
I'd like to be able to compare if dataframes are equal to one another.
import daft
import numpy as np
arr = np.arange(100)
df1 = daft.from_pydict({"a": arr})
df2 = daft.from_pydict({"a": arr, })
assert df1 == df2
# AssertionError
Describe the solution you'd like I think there's a few things to consider here. Since dataframes can either be loaded/unloaded we'd probably have to have some logic to check a few things before checking the actual values.
- Are they both loaded/unloaded
- Are the schemas equal
- Is any other metadata different?
- are the counts the same
- finally start comparing values.
I think using the __eq__ method is fine, but a .equals method would allow for more configuration such as null handling
df1.equals(df2)
df1.equals(df2, null_eq=True)
Describe alternatives you've considered manually compare dataframes.
Any thoughts also on partitioning? They could contain the same data (and same order) globally, but partitioning might differ.
I feel like perhaps the safest option might just be to compare the logical plans...