arrow
arrow copied to clipboard
[Python] Table.from_pandas creates duplicate column names if the dataframe already contains __index_level_i__ columns
Describe the bug, including details regarding any error messages, version, and platform.
The pandas -> arrow conversion adds a __inex_level_i__ column if the dataframe has an unnamed it wants to preserve (i.e. if it is not just a pandas RangeIndex). But if your dataframe already has such a column, you end up with a duplicate field:
In [40]: df = pd.DataFrame({"col": [1, 2, 3], "__index_level_0__": [1, 2, 3]}, index=[2, 3, 4])
In [41]: df
Out[41]:
col __index_level_0__
2 1 1
3 2 2
4 3 3
In [42]: pa.table(df)
Out[42]:
pyarrow.Table
col: int64
__index_level_0__: int64
__index_level_0__: int64
----
col: [[1,2,3]]
__index_level_0__: [[1,2,3]]
__index_level_0__: [[2,3,4]]
We could have it bump the integer number in the generated column? (although we would have to check how that works in the full roundtrip then)
Component(s)
Python
Related issues
- https://github.com/apache/arrow/issues/44059
I am trying out ideas in this draft PR: https://github.com/apache/arrow/pull/46884 in case there are any comments already.