arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[Python] `pyarrow.Table` equality comparison behavior is unexpected

Open judahrand opened this issue 1 year ago • 6 comments

Describe the bug, including details regarding any error messages, version, and platform.

>>> table = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table.equals(table)
True
>>> table_1 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_2 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_1.equals(table_2)
False

This seems... surprising to me. If the NaN comparison is the arrays always results in False then it is surprising that the first result is True.

Component(s)

Python

judahrand avatar Sep 06 '24 10:09 judahrand

take

Yimche avatar Oct 09 '24 02:10 Yimche

After some reconsideration, @judahrand, why would this be an issue? An object being equal to itself seems reasonable, no matter the value stored inside?

Yimche avatar Oct 20 '24 06:10 Yimche

An object being equal to itself seems reasonable, no matter the value stored inside?

I disagree. For an explicit .equals method I would expect the result to be the same whether the Python object is the same or if the object is a copy. This is not the current behaviour:

>>> import pyarrow as pa
>>> table_1 = pa.Table.from_pydict({"foo": [0.1]})
>>> table_2 = pa.Table.from_pydict({"foo": [0.1]})
>>> table_1.equals(table_2)
True
>>> table_1 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_2 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_1.equals(table_2)
False
>>> table_1.equals(table_1)
True

It feels really weird and unexpected to me for .equals to behave differently depending on the contents on the table. It seems like the current implementation (in pseudo code) is more like:

def equals(self, other):
    if self is other:
        return True
    return self == other

judahrand avatar Oct 20 '24 07:10 judahrand

I would expect the result to be the same whether the Python object is the same or if the object is a copy

I think I see where you're coming from, and I think you're correct as what is happening here is that (if my understanding of the equality function is correct):

    def equals(self, Table other, bint check_metadata=False):
        self._assert_cpu()
        if other is None:
            return False

        cdef:
            CTable* this_table = self.table
            CTable* other_table = other.table
            c_bool result

        with nogil:
            result = this_table.Equals(deref(other_table), check_metadata)

        return result

In terms of C, since it points to itself, and by extension the same "NaN" data, it results to being equal. The copy previously notioned wasn't a copy of the pointer, but a copy by value, hence the failed comparison, as I think C would have just had whatever data was previously there, if it wasn't just filled with random data. If we were to instead do a copy by reference (i.e. making table_2 point to the same point of memory) it passes (but note this is just restating an equals to self).

>>> import pyarrow as pa
>>> table_1 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_2 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_1.equals(table_2)
False
>>> table_2 = table_1
>>> table_1.equals(table_2)
True

I will note that when looking for similar situations I noticed some inconsistencies in how other libraries deal with such an edge case with NaNs:

>>> import numpy as np
>>> a1 = np.array([float("nan")])
>>> a2 = np.array([float("nan")])
>>> np.array_equal(a1, a2)
False
>>> np.array_equal(a1, a1)
False

and

>>> import pandas as pd
>>> table1 = pd.DataFrame([[float("nan")]])
>>> table2 = pd.DataFrame([[float("nan")]])
>>> table1.equals(table2)
True
>>> table1.equals(table1)
True

So this feels like something a pyarrow maintainer or the greater community need to decide on as the "correct" behaviour for the equality comparison.

Yimche avatar Oct 20 '24 10:10 Yimche

Perhaps @kou if you could spare the time to explain potential thoughts on the matter?

Yimche avatar Oct 20 '24 10:10 Yimche

I think that this is a bug:

diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc
index 5dc5e4c1a9..44eab9902d 100644
--- a/cpp/src/arrow/table.cc
+++ b/cpp/src/arrow/table.cc
@@ -534,9 +534,6 @@ Result<std::shared_ptr<Table>> PromoteTableToSchema(const std::shared_ptr<Table>
 }
 
 bool Table::Equals(const Table& other, bool check_metadata) const {
-  if (this == &other) {
-    return true;
-  }
   if (!schema_->Equals(*other.schema(), check_metadata)) {
     return false;
   }

kou avatar Oct 20 '24 12:10 kou