GH-22232: [C++][Python] Introduce optional default_column_type parameter
Rationale for this change
Add an optional default_column_type parameter to the CSV reading API (C++ and Python) to provide a fallback type when per-column types aren’t specified, improving schema consistency and complementing the existing column_types logic.
What changes are included in this PR?
- c++: new convert option "default_column_type" to augment logic around column_types parameter
- 3 reader tests: DefaultColumnTypePartialDefault, DefaultColumnTypeAllStringsWithHeader, DefaultColumnTypeAllStringsNoHeader). The last two tests are inspired by https://github.com/pandas-dev/pandas/pull/62242 and https://github.com/pandas-dev/pandas/issues/57666
- python: corresponding changes to make cpp change consumable from python
- python: extended test_convert_options test - include. default_column_type
- python: added new test "test_default_column_type" which tests how the field impacts schema; also test implicitly verifies leading zero preservation
- relevant documentation update for python component;
Are these changes tested?
Yes. Existing and new tests are passing.
C++:
> [==========] Running 3 tests from 1 test suite.
> [----------] Global test environment set-up.
> [----------] 3 tests from ReaderTests
> [ RUN ] ReaderTests.DefaultColumnTypePartialDefault
> [ OK ] ReaderTests.DefaultColumnTypePartialDefault (3 ms)
> [ RUN ] ReaderTests.DefaultColumnTypeAllStringsWithHeader
> [ OK ] ReaderTests.DefaultColumnTypeAllStringsWithHeader (0 ms)
> [ RUN ] ReaderTests.DefaultColumnTypeAllStringsNoHeader
> [ OK ] ReaderTests.DefaultColumnTypeAllStringsNoHeader (0 ms)
> [----------] 3 tests from ReaderTests (4 ms total)
>
> [----------] Global test environment tear-down
> [==========] 3 tests from 1 test suite ran. (4 ms total)
> [ PASSED ] 3 tests.
All:
> [==========] 264 tests from 46 test suites ran. (452 ms total)
> [ PASSED ] 264 tests.
pyarrow: New tests are passing.
Are there any user-facing changes?
I believe this change is backward compatible. Parameter is optional and its default value doesn't change the existing behavior; All the existing rests are passing.
Maybe relevant: https://github.com/apache/arrow/issues/22232
Relates to https://github.com/apache/arrow/issues/47502
-
GitHub Issue: #47502
-
GitHub Issue: #22232
:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.
:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.
:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.
:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.
:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.
:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.
@github-actions crossbow submit preview-docs
Only contributors can submit requests to this bot. Please ask someone from the community for help with getting the first commit in.
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/18062577036
:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.
:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.
Thank you @vladborovtsov for the contribution. I will add info about the proposed solution in the original issue (https://github.com/apache/arrow/issues/22232) so I can see opinions from C++ devs on the proposed solution.
:warning: GitHub issue #22232 has been automatically assigned in GitHub to PR creator.
Hi @AlenkaF I'm happy to continue the labour and discussion to get that merged. As for AI, it wasn't used much, although I tried :) With such huge codebase the generation quality is quite low.
Happy to see a response! All good, it is totally OK to use gen AI wisely ;)
I will wait for an opinion from a C++ dev and in the meantime try to look at the Python part.
Hi @AlenkaF Any feedback?