arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-36593: [Python] Add rename_columns method to pyarrow datasets

Open JonatanMartens opened this issue 4 weeks ago • 1 comments

Rationale for this change

See https://github.com/apache/arrow/issues/36593 In particular this change is convenient when the column names stored in a file are different from the logical names associated with the columns (see deltalake column mapping feature as an example).

What changes are included in this PR?

Adds the rename_columns method to datasets in pyarrow. This mehod allows a user to rename columns in the data returned from a scan before actually creating a scanner object.

Are these changes tested?

This PR also add a test for the new rename_columns method using an InMemoryDataset.

Are there any user-facing changes?

Adds the rename_columns method to pyarrow datasets.

  • GitHub Issue: #36593

JonatanMartens avatar Nov 29 '25 12:11 JonatanMartens

:warning: GitHub issue #36593 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Nov 29 '25 12:11 github-actions[bot]

@rok @raulcd @AlenkaF It's ready to merge now, could you take a look?

JonatanMartens avatar Dec 22 '25 12:12 JonatanMartens

I am not sure about the changes in this PR, mainly because I am not very knowledgable when it comes to Acero and datasets. The functionality seems great to have, but modifying _scan_options for change of column names on read feels a bit hacky.

What do you think @rok ?

AlenkaF avatar Dec 23 '25 09:12 AlenkaF

The change looks good to me in principle. I do agree with @AlenkaF that changing _scan_options seems a bit forced and could have unexpected consequences elsewhere. Can you check if there is a nicer way?

rok avatar Dec 23 '25 09:12 rok

Sounds good, I'm now using a new attribute called _columns instead of relying on _scan_options @rok @AlenkaF

JonatanMartens avatar Dec 23 '25 11:12 JonatanMartens