arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-22232: [C++][Python] Introduce optional default_column_type parameter

Open vladborovtsov opened this issue 3 months ago • 15 comments

Rationale for this change

Add an optional default_column_type parameter to the CSV reading API (C++ and Python) to provide a fallback type when per-column types aren’t specified, improving schema consistency and complementing the existing column_types logic.

What changes are included in this PR?

  • c++: new convert option "default_column_type" to augment logic around column_types parameter
  • 3 reader tests: DefaultColumnTypePartialDefault, DefaultColumnTypeAllStringsWithHeader, DefaultColumnTypeAllStringsNoHeader). The last two tests are inspired by https://github.com/pandas-dev/pandas/pull/62242 and https://github.com/pandas-dev/pandas/issues/57666
  • python: corresponding changes to make cpp change consumable from python
  • python: extended test_convert_options test - include. default_column_type
  • python: added new test "test_default_column_type" which tests how the field impacts schema; also test implicitly verifies leading zero preservation
  • relevant documentation update for python component;

Are these changes tested?

Yes. Existing and new tests are passing.

C++:

> [==========] Running 3 tests from 1 test suite.
> [----------] Global test environment set-up.
> [----------] 3 tests from ReaderTests
> [ RUN      ] ReaderTests.DefaultColumnTypePartialDefault
> [       OK ] ReaderTests.DefaultColumnTypePartialDefault (3 ms)
> [ RUN      ] ReaderTests.DefaultColumnTypeAllStringsWithHeader
> [       OK ] ReaderTests.DefaultColumnTypeAllStringsWithHeader (0 ms)
> [ RUN      ] ReaderTests.DefaultColumnTypeAllStringsNoHeader
> [       OK ] ReaderTests.DefaultColumnTypeAllStringsNoHeader (0 ms)
> [----------] 3 tests from ReaderTests (4 ms total)
> 
> [----------] Global test environment tear-down
> [==========] 3 tests from 1 test suite ran. (4 ms total)
> [  PASSED  ] 3 tests.

All:

> [==========] 264 tests from 46 test suites ran. (452 ms total)
> [  PASSED  ] 264 tests.

pyarrow: New tests are passing.

Are there any user-facing changes?

I believe this change is backward compatible. Parameter is optional and its default value doesn't change the existing behavior; All the existing rests are passing.

Maybe relevant: https://github.com/apache/arrow/issues/22232

Relates to https://github.com/apache/arrow/issues/47502

  • GitHub Issue: #47502

  • GitHub Issue: #22232

vladborovtsov avatar Sep 27 '25 12:09 vladborovtsov

:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Sep 27 '25 12:09 github-actions[bot]

:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Sep 27 '25 13:09 github-actions[bot]

:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Sep 27 '25 15:09 github-actions[bot]

:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Sep 27 '25 16:09 github-actions[bot]

:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Sep 27 '25 16:09 github-actions[bot]

:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Sep 27 '25 16:09 github-actions[bot]

@github-actions crossbow submit preview-docs

vladborovtsov avatar Sep 27 '25 17:09 vladborovtsov

Only contributors can submit requests to this bot. Please ask someone from the community for help with getting the first commit in.
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/18062577036

github-actions[bot] avatar Sep 27 '25 17:09 github-actions[bot]

:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Sep 27 '25 19:09 github-actions[bot]

:warning: GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Oct 13 '25 09:10 github-actions[bot]

Thank you @vladborovtsov for the contribution. I will add info about the proposed solution in the original issue (https://github.com/apache/arrow/issues/22232) so I can see opinions from C++ devs on the proposed solution.

AlenkaF avatar Oct 24 '25 09:10 AlenkaF

:warning: GitHub issue #22232 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Oct 24 '25 09:10 github-actions[bot]

Hi @AlenkaF I'm happy to continue the labour and discussion to get that merged. As for AI, it wasn't used much, although I tried :) With such huge codebase the generation quality is quite low.

vladborovtsov avatar Oct 24 '25 09:10 vladborovtsov

Happy to see a response! All good, it is totally OK to use gen AI wisely ;)

I will wait for an opinion from a C++ dev and in the meantime try to look at the Python part.

AlenkaF avatar Oct 24 '25 09:10 AlenkaF

Hi @AlenkaF Any feedback?

vladborovtsov avatar Dec 10 '25 17:12 vladborovtsov