BUG: ValueError with loc[] = (Regression 2.1.0)
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame(index=[1, 1, 2, 2], data=["1", "1", "2", "2"])
df.loc[df[0].str.len() > 1, 0] = df[0]
df
Issue Description
Given code fails on the third line with exception given below Code executes normally with panda versions <2.1.0
Traceback (most recent call last):
File "", line 3, in
Expected Behavior
Code should execute normally with result 0 1 1 1 1 2 2 2 2
(No reindexing should be necessary since no rows are selected with code on line 3.)
Installed Versions
INSTALLED VERSIONS
commit : ba1cccd19da778f0c3a7d6a885685da16a072870 python : 3.9.0.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19041 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Ireland.1252 pandas : 2.1.0 numpy : 1.24.2 pytz : 2023.3 dateutil : 2.8.2 setuptools : 65.5.1 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None bs4 : None bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
Thanks for the report!
(No reindexing should be necessary since no rows are selected with code on line 3.)
To be sure, it's not the left hand side that is reindexing, it's the right. E.g.
df.loc[df[0].str.len() > 1, 0] = 5
works. I believe we raise anytime the RHS has a duplicate value because the result can be ambiguous, even though it won't necessarily be ambiguous. In general we try to avoid values-dependent behavior. In this case, if it just so happens that in one case the mask on the left is all False you may think the code works, but will then fail as soon as it isn't all False. That can be a bad user experience.
Code executes normally with panda versions <2.1.0
Ah, I missed this! Thanks for that detail. We should run a git blame and see where this ended up getting changed.
take
take