pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: Renaming a dataframe columns with a series containing duplicated index corrupts the dataframe

Open mixmixmix opened this issue 9 months ago • 4 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# Define a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

# Create a Series for renaming columns, with a non-unique index
rename_series = pd.Series(['X', 'Y', 'Z', 'W'], index=['A', 'B', 'B', 'C'])

# Rename columns using the filtered Series
df.rename(columns=rename_series, inplace=True)

print(df) #TypeError: unhashable type: 'Series'
print(df['X'])#TypeError: cannot convert the series to <class 'int'>

The following extended example shows that dataframe can appear uncorrupted if display is not reaching the problematic column names:

import pandas as pd
import numpy as np

# Define a DataFrame of size 180x200
data = np.random.randint(1, 100, size=(20, 60))
columns = [f'Col_{i}' for i in range(60)]
df = pd.DataFrame(data, columns=columns)

# Create a Series for renaming columns, ensuring all names are unique except for two in the middle
new_names = [f'New_{i}' for i in range(61)]
old_names = [f'Col_{i}' for i in range(30)] + ['Col_29'] + [f'Col_{i}' for i in range(30, 60)]

rename_series = pd.Series(new_names, index=old_names)
# Apply renaming to the DataFrame
df.rename(columns=rename_series, inplace=True)

df #works
df['New_0'] #TypeError: cannot convert the series to <class 'int'>

Issue Description

When renaming dataframe columns with a Series containing duplicates indexing no error is thrown but dataframe is corrupted.

Expected Behavior

It should either produce a valid dataframe like using insteadSeries.to_dict() would, or throw an error during conversion.

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.3.final.0 python-bits : 64 OS : Darwin OS-release : 23.4.0 Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:12:41 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T8103 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8

pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : None pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : 1.4.6 psycopg2 : 2.9.9 jinja2 : None IPython : 8.23.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 2.0.29 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

mixmixmix avatar May 07 '24 20:05 mixmixmix

Hi @mixmixmix, I was able to reproduce what you provided. However, from my personal view, why not just use a dict instead of a Series? As stated in the documentation example, a dict might be more preferable.

import pandas as pd
import numpy as np

# Define a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

df.rename(columns={'A': 'X', 'B': 'Y', 'C': 'W'}, inplace=True, errors="raise")
#    X  Y  W
# 0  1  4  7
# 1  2  5  8
# 2  3  6  9

# Define a DataFrame of size 20x60
data = np.random.randint(1, 100, size=(20, 60))
columns = [f'Col_{i}' for i in range(60)]
df = pd.DataFrame(data, columns=columns)

# Create a Series for renaming columns, ensuring all names are unique except for two in the middle
new_names = [f'New_{i}' for i in range(61)]
old_names = [f'Col_{i}' for i in range(30)] + ['Col_29'] + [f'Col_{i}' for i in range(30, 60)]

df.rename(columns={old_names[i]: new_names[i] for i in range(61)}, inplace=True, errors="raise")
df['New_0']  # works

luke396 avatar May 16 '24 03:05 luke396

Hi @mixmixmix, I was able to reproduce what you provided. However, from my personal view, why not just use a dict instead of a Series? As stated in the documentation example, a dict might be more preferable.

Thanks @luke396 , and: yes absolutely using dicts makes more sense!. However, still if the option of using Series is possible, I think it should return an error if it cannot create a valid columns for the dataframe.

mixmixmix avatar May 17 '24 15:05 mixmixmix

@luke396 I can look into this. If you want

shoaib-moeen avatar May 17 '24 19:05 shoaib-moeen

@luke396 I can look into this. If you want

Of course, anyone can contribute to pandas. Replying 'take' will assign the issue to you.

luke396 avatar May 18 '24 03:05 luke396