arrow ARROW-17964: [C++] Range data comparison for struct type may go out of bounds

See https://issues.apache.org/jira/browse/ARROW-17964

Oct 07 '22 19:10 rtpsw

https://issues.apache.org/jira/browse/ARROW-17964

Oct 07 '22 19:10 github-actions[bot]

We should probably add a unit test for this

Oct 07 '22 19:10 lidavidm

We should probably add a unit test for this

Agreed. I'm not sure how I'd like to test this just yet, so ideas are welcomed. At first, I thought I could somehow use AssertDatumsEqual, but it cannot be directly negated.

Oct 07 '22 21:10 rtpsw

@lidavidm, are you aware of any example test case that checks for accessing an index out of bounds?

Oct 07 '22 21:10 rtpsw

No, although looking at it, the fix is a little weird - it means we're comparing data of two different types, or the data doesn't match its type?

Regardless, it seems something like this would suffice?

auto lhs = ArrayFromJSON(..., "[{"a": 2, "b": 3}]");
auto rhs = ArrayFromJSON(..., "[{"a": 2}]");
ASSERT_FALSE(lhs->Equals(*rhs));  // assuming this segfaults without this PR

Oct 10 '22 12:10 lidavidm

Right, since the data types are supposed to match, this PR is only guarding against invalid data. It you want to make sure data is valid, you should call Validate or ValidateFull before doing anything else.

Oct 10 '22 13:10 pitrou

No, although looking at it, the fix is a little weird - it means we're comparing data of two different types, or the data doesn't match its type?

I run into the issue that led to this PR when a unit test compared data of two different types, one is the expected and another is the unexpected. My quick impression was that there seemed to be sufficiently many test cases that compare this way, without trying to compare types or to validate first. I figured it would be easier to fix the comparison in one place than to fix many test cases.

Right, since the data types are supposed to match, this PR is only guarding against invalid data. It you want to make sure data is valid, you should call Validate or ValidateFull before doing anything else.

I may be missing something, but I don't think validating would have fixed the issue, because I believe in the above case both expected and actual data were valid, since they both were obtained via JSON parsing; they were just of different types.

Regardless, it seems something like this would suffice? ... // assuming this segfaults without this PR

I didn't check your particular case yet, but I did run into a segfault in the case I described above. My vote would be to minimize segfaults as an indication of test failure, if only because such a failure would be less convenient to work with.

Oct 10 '22 13:10 rtpsw

I think it's OK to have this, I wonder why the comparison doesn't start with a type comparison though (which should avoid this class of issues)

Oct 10 '22 14:10 lidavidm

It does start with a type comparison, it's also mentioned above: https://github.com/apache/arrow/blob/d4190cc9ad15d30cb8b840f8a6df25c006d8009f/cpp/src/arrow/compare.cc#L164-L169 and you can see an example of type checking here: https://github.com/apache/arrow/blob/d4190cc9ad15d30cb8b840f8a6df25c006d8009f/cpp/src/arrow/compare.cc#L547-L554

Oct 10 '22 14:10 pitrou

I think it's OK to have this, I wonder why the comparison doesn't start with a type comparison though (which should avoid this class of issues)

AFAICS, the path of invocations seems to be Array::Equals -> ArrayEquals -> ArrayRangeEquals -> CompareArrayRanges -> TypeEqualsVisitor::Visit(const StructType &). This path does not go thorough TypeEqualsVisitor::VisitChildren, for example, that would have checked for equal types. Perhaps all visitors methods in TypeEqualsVisitor should start with a type comparison.

Oct 10 '22 14:10 rtpsw

It does start with a type comparison

Right. As I just noted in crossing, I suspect the issue is with TypeEqualsVisitor which it uses.

Oct 10 '22 14:10 rtpsw

Can you show a snippet that would show the issue?

Oct 10 '22 14:10 pitrou

Can you show a snippet that would show the issue?

Good chances I could. I'll need a bit of time to get to this, though.

Oct 10 '22 14:10 rtpsw

Can you show a snippet that would show the issue?

Good chances I could. I'll need a bit of time to get to this, though.

I added a test that checks for this by comparing a badly structures array with a correctly structured one:

The final check in the test covers what you asked for. The other checks appearing before it are for normal conditions.
Without the change in RangeDataEqualsImpl, the final check fails due to not observing a failure.
Without the change in PrintDiff (but with the change in RangeDataEqualsImpl), the final check leads to a segmentation fault.

Oct 13 '22 13:10 rtpsw

Okay, so here is the problem: users shouldn't pass invalid data to Arrow APIs (except to Validate and ValidateFull, which are explicitly designed to handle such data). So it doesn't make sense to check for invalid data at the beginning of other functions; also, it can be quite costly (ValidateFull can typically be O(nrows * columns)).

(note: "invalid data" here is a badly structured array)

Oct 13 '22 13:10 pitrou

Okay, so here is the problem: users shouldn't pass invalid data to Arrow APIs (except to Validate and ValidateFull, which are explicitly designed to handle such data). So it doesn't make sense to check for invalid data at the beginning of other functions; also, it can be quite costly (ValidateFull can typically be O(nrows * columns)).

(note: "invalid data" here is a badly structured array)

This circles back to points we discussed. I can understand the requirement of passing valid data in a correct Arrow app, as well as in correct Arrow code, but less so during its development, where incorrect code frequently occurs. This PR aims to make (failure analysis during) development easier, given that its runtime cost is small. For the purpose of cost, I think the calls to ValidateFull in PrintDiff shouldn't count because they can be removed - I only give them for reproducibility without a segmentation fault.

Oct 13 '22 13:10 rtpsw

But again, during development you can call Validate[Full] in your own code. It doesn't really make sense to randomly add checks in Arrow functions, IMHO.

Oct 13 '22 14:10 pitrou

But again, during development you can call Validate[Full] in your own code. It doesn't really make sense to randomly add checks in Arrow functions, IMHO.

Well, not really randomly, but I understand what you're saying. While I agree one can easily add validation calls to the code while developing, I still think it's not convenient because in many cases the segfault is not making it easy to determine which structure is invalid. Moreover, a segfault is a relatively good result; when luck betrays, the result might be a memory leak or a buffer overrun that would make analysis of the root cause much harder.

Having said that, since you seem to be firm in your opinion, I'll stop pushing for this PR. It's not high priority.

Oct 13 '22 15:10 rtpsw