arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-44232: [Python] Validate __arrow_c_array__ length for scalar construction

Open Lukedowling444 opened this issue 1 year ago • 5 comments

Rationale for this change

In the current implementation of the pyarrow.scalar constructor, when an input has the arrow_c_array interface, there is no check to ensure the length of the array is exactly 1. This can lead to unexpected behavior or invalid data being accepted for scalar construction.

This change introduces a validation step that checks whether the input array has a length of 1 when using the arrow_c_array interface, ensuring that only valid inputs are processed. This improves the reliability of Arrow when handling data from multiple libraries.

An existing workaround involves calling pa.array(other_scalar).slice(0, 1) to achieve the same result, but this change eliminates the need for such a workaround.

Looking to close #44232

What changes are included in this PR?

A validation check was added inside the pyarrow.scalar constructor to ensure that when an input with the arrow_c_array interface is provided, it must be a length-1 array. The constructor now raises a ValueError if this condition is not met.

Are these changes tested?

Yes, the project builds successfully, and all existing test cases pass without any issues.

Additionally, I have included new test cases to specifically check the behavior of this change in python/scripts/test_scalar.py. These tests ensure that the pyarrow.scalar constructor raises a ValueError when a non-length-1 array is provided and continues to work correctly for valid inputs.

Are there any user-facing changes?

No user-facing changes are introduced, and this validation should not affect any workflows where inputs to pyarrow.scalar are already correct.

GitHub Issue: https://github.com/apache/arrow/issues/44232

  • GitHub Issue: #44232

Lukedowling444 avatar Oct 16 '24 03:10 Lukedowling444

Hi @lukedowling44. This looks awsome. However, I think you might have accidentally commited your build folder (\dist) and your virtual environment (\pyarrow-dev) as well. I think these files are better not get committed to the git repository.

ghost avatar Oct 16 '24 07:10 ghost

Hi @lukedowling44. This looks awsome. However, I think you might have accidentally commited your build folder (\dist) and your virtual environment (\pyarrow-dev) as well. I think these files are better not get committed to the git repository.

To clarify, make sure to only commit your tests and scalar.pxi changes.

piratepanda805 avatar Oct 16 '24 07:10 piratepanda805

@lukedowling44 This looks perfect now and have fixed the previous issue. I would like to ask why most of the statements inside the if-clause is quite similar to the execution statements in the following.

ghost avatar Oct 18 '24 12:10 ghost

@Isaac7777-cpu-school You are right, I was repeating code unnecessarily. I have made changes and pushed again.

Lukedowling444 avatar Oct 19 '24 00:10 Lukedowling444

@Lukedowling444 Nice, all the dead codes are removed. I think it is pretty much ready for the PR. However, just a minor thing and this may not make sense as much - why do we need another local variable for the carray rather than directly put it in the if condition? However, since this is really just a styling issue, please also consult with the maintainer for this as well?

ghost avatar Oct 19 '24 09:10 ghost

@Isaac7777-cpu-school Thanks for the feedback. Cython doesn’t allow cdef statements directly within conditional logic or non-function scopes, so I had to define it before the condition.

Lukedowling444 avatar Oct 27 '24 03:10 Lukedowling444

cc @AlenkaF

pitrou avatar Mar 03 '25 16:03 pitrou