pandapower
pandapower copied to clipboard
[Fix] Surface chained assignment warnings
Chained assignment in pandas has long been discouraged and will break in pandas 3.0:
By default, pandas generates warnings if chained assignment is used, but they are disabled in pandapower. These warnings should be visible so the offending code can get fixed.
This PR restores the default chained assignment warnings. It also removes the blanket suppression of all warnings in protection.utility_functions.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 75.39%. Comparing base (
3b23e85) to head (c4cfa69).
Additional details and impacted files
@@ Coverage Diff @@
## develop #2610 +/- ##
===========================================
- Coverage 75.39% 75.39% -0.01%
===========================================
Files 288 288
Lines 35233 35229 -4
===========================================
- Hits 26565 26561 -4
Misses 8668 8668
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@jthurner, thanks for bringing this up.
But we just recently found out, that the fix suggested by pandas does not produce the same result and we are now having issues with lots of fixed chain assignments, which are incorrectly fixed, due to this warning. Currently we are discussing how to reverse the problem and find a proper fix.
Until then I won't change this warning, to not encourage more people to add the "wrong fix", until we have figured out a solution.
But we just recently found out, that the fix suggested by pandas does not produce the same result and we are now having issues with lots of fixed chain assignments, which are incorrectly fixed, due to this warning.
Interesting, can you link an example of this problem?
FWIW, setting mode.copy_on_write to "warn" is supposed to generate false positives, but that should be False by default in 2.2.
@vogt31337 @jthurner Hi!
I'm writing to clarify the behavior of chain assignments previously used in the codebase. Example:
net[elm].bus[connected_elms] = net[elm].bus[elm1]
As you’ve noted, this line relies on chained indexing, which can result in unpredictable behavior depending on whether pandas returns a view or a copy. A commonly suggested "safe" alternative is:
net[elm].loc[connected_elms, 'bus'] = net[elm].bus[elm1]
The situation is further complicated by the fact that different versions pandas may convey either view or copie. However, I understand from previous discussions that applying this fix may not always produce the same result as the original line, and that changes to the behavior were noticed after switching to .loc.
My question is: What was the original intended behavior? Was the goal to modify the original bus column of net[elm], or was it acceptable (or even intended) that the change might apply only to a temporary copy (with no persistence)?
I also tried to return chain assignments in those edits of PR 2342, where possible, and set the near future behavior of pandas in the conftest.py file: pd.options.mode.copy_on_write = True
import pandas as pd
# Enable Copy-on-Write behavior in pandas 3.0
def pytest_configure():
pd.options.mode.copy_on_write = True
After which I ran all the tests and they immediately crashed:
ERROR pandapower/test/plotting/test_geo.py - ValueError: assignment destination is read-only
@quant12345 @jthurner it was found by @heckstrahler in the CIM converter. Maybe he can point to the actual location. @quant12345 I cannot tell you which behaviour is wanted, since these errors seem to have been fixed by several people in many locations. So each location has to be checked individually...
The problem was in pandapower/converter/cim/cim2pp/convert_measurements.py line 35, and was fixed by @mrifraunhofer in #2617 The old 'fixed but not fixed' code was not able to handle different kinds of measurements and added rows filled with nan values for every kind of measurment after the first one.
Maybe I am just missing something, but I don't get how this change would not be functionally equivalent (for pandas 2). It's a one-line assignement, there are no side effects either before or after the change, and it fits the basic example in the pandas CoW migrations docs.
It seems somewhat unlikely that you hit a weird corner case here that no one else has run into.
The "fix of the fix" does not only change the assignment, but a whole bunch of other things around it, and changes to the structure of the incoming data might confuse things further (e.g. here or here).
Either way, if there is a risk of "wrongly fixing" chained assignments, it would be helpful to be able to tell people what exactly to look out for. To better understand the problem, could you isolate the change on develop like this:
# revert the latest fix
git revert cda2537
# revert only the .loc change in question
curl https://github.com/e2nIEE/pandapower/commit/f9499f500edbbb0226cf8eda7d391ce37976aa33.patch | git apply -R --include pandapower/converter/cim/cim2pp/convert_measurements.py
then check if the bug is still there. If it is not, post a test case / the different versions of the measurements-df if you want others to have a look as well.
I tried to return chain assignments in the _copy_to_measurement function by
changing the code as follows(chain assignments - commented out):
class CreateMeasurements:
def __init__(self, net: pandapowerNet, cim: Dict):
self.logger = logging.getLogger(self.__class__.__name__)
self.net = net
self.cim = cim
# temporary variables
self.state = 0
self.df = pd.DataFrame()
def _set_measurement_element_datatype(self):
self.net.measurement.element = self.net.measurement.element.astype(np.sctypeDict.get("UInt32"))
def _copy_to_measurement(self, input_df: pd.DataFrame):
pp_type = 'measurement'
self.logger.debug("Copy %s datasets to pandapower network with type %s" % (input_df.index.size, pp_type))
if pp_type not in self.net.keys():
self.logger.warning("Missing pandapower type %s in the pandapower network!" % pp_type)
return
#'''
start_index_pp_net = self.net[pp_type].index.size
self.net[pp_type] = pd.concat([self.net[pp_type], pd.DataFrame(None, index=[list(range(input_df.index.size))])],
ignore_index=True, sort=False)
for one_attr in self.net[pp_type].columns:
if one_attr in input_df.columns:
self.net[pp_type].loc[start_index_pp_net:, one_attr] = input_df[one_attr][:]
#self.net[pp_type][one_attr][start_index_pp_net:] = input_df[one_attr][:]
#'''
# fix 2617
'''
if input_df.empty:
return
self.net[pp_type] = pd.concat([self.net[pp_type],
input_df[list(set(self.net[pp_type].columns).intersection(input_df.columns))]],
ignore_index=True, sort=False)
'''
self.state += 1
if self.state > 1:
print('equals !!!', self.net[pp_type].equals(self.df))
if self.state == 15:
print('equals !!!', self.net[pp_type].equals(self.df))
self.net[pp_type].to_csv('loc.csv')
#self.net[pp_type].to_csv('chained.csv')
self.df = self.net[pp_type]
When running a test with chain assignments and separately with loc, I print the result after processing (at the end of the code), this happens 15 times, with fix2617 edits, printing happens once because of if input_df.empty:
pytest pandapower/test/converter/test_from_cim.py::test_Simbench_1_EHV_mixed__2_no_sw_res_gen -s
I also compare the result of the previous output with the current one:
print('equals !!!', self.net[pp_type].equals(self.df))
and on the 15th call I wrote both variants to the files. The result of fix2617 is also written.
When comparing all three, the only difference is fix2617['element'] is of type int64 in chain assignments, loc float64
and that in older versions there is no check for empty input_df, which is why the code is executed completely 15 times.
Is there any special case to reproduce the error when using loc?
difference is fix2617['element'] is of type int64 in chain assignments, loc float64
Did you see the different dtypes when comparing the isolated chained assignment change? Or only with fix2617 (possibly from the empty df check)?
difference is fix2617['element'] is of type int64 in chain assignments, loc float64
Did you see the different dtypes when comparing the isolated chained assignment change? Or only with fix2617 (possibly from the empty df check)?
The dataframe column types (chained and loc) match. They do not match the fix2617 dataframe.
I used the code provided to write three dataframes separately to three files: chained, loc, fix2617. Checked with compare and it returns an empty dataframe when compared. But equals returns False. I also checked the column names, indexes, shape, they are identical for all dataframes, except for the column types. Here is the result of comparing the column types:
print('chained.equals(fix2617): ', chained.equals(fix2617))
print('loc.equals(fix2617) :', loc.equals(fix2617))
print('chained.equals(loc) :', chained.equals(loc))
print('chained.dtypes == fix2617.dtypes :\n', chained.dtypes == fix2617.dtypes)
print('loc.dtypes == fix2617.dtypes :\n', loc.dtypes == fix2617.dtypes)
print('chained.dtypes == loc.dtypes :\n', chained.dtypes == loc.dtypes)
click output:
chained.equals(fix2617): False
loc.equals(fix2617) : False
chained.equals(loc) : True
chained.dtypes == fix2617.dtypes :
Unnamed: 0 True
name True
measurement_type True
element_type True
element False
value True
std_dev True
side True
origin_id True
origin_class True
source True
analog_id True
terminal_id True
description True
dtype: bool
loc.dtypes == fix2617.dtypes :
Unnamed: 0 True
name True
measurement_type True
element_type True
element False
value True
std_dev True
side True
origin_id True
origin_class True
source True
analog_id True
terminal_id True
description True
dtype: bool
chained.dtypes == loc.dtypes :
Unnamed: 0 True
name True
measurement_type True
element_type True
element True
value True
std_dev True
side True
origin_id True
origin_class True
source True
analog_id True
terminal_id True
description True
dtype: bool
If you change fix2617['element'] to float64, then equals returns True.
With the old code in both versions the element column initially has the type: uint32 and changes after this line:
self.net[pp_type] = pd.concat([self.net[pp_type], pd.DataFrame(None, index=[list(range(input_df.index.size))])],
ignore_index=True, sort=False)
Here is a small example of a difference in these calls that could lead to different results:
from pandas import DataFrame
df1 = DataFrame({'1': [.1, .2, .3], '2': [1, 4, 5]})
df2 = DataFrame({'1': ['a', 'b', 'c', 'd', 'e'], '2': [1, 2, 3, 4, 5]})
df4 = DataFrame(df2)
df4.loc[1:, '2'] = df1['1'][:]
df2['2'][1:] = df1['1'][:]
For a more detailed explanation click here
A more detailed explanation: The df1 has 3 values per column, so `df1['1'][:]` has length 3 but `df4.loc[1:, '2']` is a slice of 4 values so if all values are to be overwritten it would overwrite the last value with something not contained in df1 aka. nan. Intuitively one would expect it to take the values and apply them like a copy paste in excel and only overwrite for the length given leaving the last value untouched.The first call using .loc with a slice indexer adds all values to the output df overwriting anything past the length of the value with nan.
The second call fails because the selected section of df2 is longer than the values provided.
While this might not be the exact scenario we had here, it is similar enough though. If the error got caught before and ignored or acted upon it would now lead to a different outcome than overwriting stuff with nan. So I also prefer not to encourage everyone to go fix these issues but to have several devs who understand this to go trough the warnings and change them accordingly. Not every place these warnings were thrown has enough tests to catch such edge cases.
Can do something similar to using chain assignments by using iloc instead of loc:
# get the column index
ind_col = df4.columns.get_loc('2')
df4.iloc[1:, ind_col] = np.array(df1['1'][:])
I'm not sure if this is necessary, but you can fill in the slice df4['2'] based on the length of df1['1'] (the slice for loc is taken inclusively, that is, from rows(index) 1 to 3):
sub = np.array(df1['1'][:])
start_slice = 1
df4.loc[start_slice : len(sub), '2'] = sub
Potential problems with loc:
Another potential problem with loc. Let's say in df1 the indices are different from df4, then there will be an incorrect assignment:
df1 = pd.DataFrame({'1': [.1, .2, .3], '2': [1, 4, 5]}, index=[5, 6, 7])
df4.loc[2:, '2'] = df1['1'][:]