[Python] Table.from_pandas creates duplicate column names if the dataframe already contains __index_level_i__ columns

Open jorisvandenbossche opened this issue 8 months ago • 1 comments

Describe the bug, including details regarding any error messages, version, and platform.

The pandas -> arrow conversion adds a __inex_level_i__ column if the dataframe has an unnamed it wants to preserve (i.e. if it is not just a pandas RangeIndex). But if your dataframe already has such a column, you end up with a duplicate field:

In [40]: df = pd.DataFrame({"col": [1, 2, 3], "__index_level_0__": [1, 2, 3]}, index=[2, 3, 4])

In [41]: df
Out[41]: 
   col  __index_level_0__
2    1                  1
3    2                  2
4    3                  3

In [42]: pa.table(df)
Out[42]: 
pyarrow.Table
col: int64
__index_level_0__: int64
__index_level_0__: int64
----
col: [[1,2,3]]
__index_level_0__: [[1,2,3]]
__index_level_0__: [[2,3,4]]

We could have it bump the integer number in the generated column? (although we would have to check how that works in the full roundtrip then)

Component(s)

Python

Related issues

https://github.com/apache/arrow/issues/44059

Apr 18 '25 09:04 jorisvandenbossche

I am trying out ideas in this draft PR: https://github.com/apache/arrow/pull/46884 in case there are any comments already.

Jun 23 '25 14:06 AlenkaF