pandapower icon indicating copy to clipboard operation
pandapower copied to clipboard

[Fix] Surface chained assignment warnings

Open jthurner opened this issue 6 months ago • 10 comments

Chained assignment in pandas has long been discouraged and will break in pandas 3.0:

By default, pandas generates warnings if chained assignment is used, but they are disabled in pandapower. These warnings should be visible so the offending code can get fixed.

This PR restores the default chained assignment warnings. It also removes the blanket suppression of all warnings in protection.utility_functions.

jthurner avatar May 15 '25 09:05 jthurner

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 75.39%. Comparing base (3b23e85) to head (c4cfa69).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2610      +/-   ##
===========================================
- Coverage    75.39%   75.39%   -0.01%     
===========================================
  Files          288      288              
  Lines        35233    35229       -4     
===========================================
- Hits         26565    26561       -4     
  Misses        8668     8668              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar May 15 '25 09:05 codecov[bot]

@jthurner, thanks for bringing this up.

But we just recently found out, that the fix suggested by pandas does not produce the same result and we are now having issues with lots of fixed chain assignments, which are incorrectly fixed, due to this warning. Currently we are discussing how to reverse the problem and find a proper fix.

Until then I won't change this warning, to not encourage more people to add the "wrong fix", until we have figured out a solution.

vogt31337 avatar May 27 '25 14:05 vogt31337

But we just recently found out, that the fix suggested by pandas does not produce the same result and we are now having issues with lots of fixed chain assignments, which are incorrectly fixed, due to this warning.

Interesting, can you link an example of this problem?

FWIW, setting mode.copy_on_write to "warn" is supposed to generate false positives, but that should be False by default in 2.2.

jthurner avatar May 28 '25 06:05 jthurner

@vogt31337 @jthurner Hi!

I'm writing to clarify the behavior of chain assignments previously used in the codebase. Example: net[elm].bus[connected_elms] = net[elm].bus[elm1]

As you’ve noted, this line relies on chained indexing, which can result in unpredictable behavior depending on whether pandas returns a view or a copy. A commonly suggested "safe" alternative is: net[elm].loc[connected_elms, 'bus'] = net[elm].bus[elm1]

The situation is further complicated by the fact that different versions pandas may convey either view or copie. However, I understand from previous discussions that applying this fix may not always produce the same result as the original line, and that changes to the behavior were noticed after switching to .loc.

My question is: What was the original intended behavior? Was the goal to modify the original bus column of net[elm], or was it acceptable (or even intended) that the change might apply only to a temporary copy (with no persistence)?

I also tried to return chain assignments in those edits of PR 2342, where possible, and set the near future behavior of pandas in the conftest.py file: pd.options.mode.copy_on_write = True

import pandas as pd

# Enable Copy-on-Write behavior in pandas 3.0
def pytest_configure():
    pd.options.mode.copy_on_write = True

After which I ran all the tests and they immediately crashed:

ERROR pandapower/test/plotting/test_geo.py - ValueError: assignment destination is read-only

quant12345 avatar Jun 04 '25 16:06 quant12345

@quant12345 @jthurner it was found by @heckstrahler in the CIM converter. Maybe he can point to the actual location. @quant12345 I cannot tell you which behaviour is wanted, since these errors seem to have been fixed by several people in many locations. So each location has to be checked individually...

vogt31337 avatar Jun 16 '25 10:06 vogt31337

The problem was in pandapower/converter/cim/cim2pp/convert_measurements.py line 35, and was fixed by @mrifraunhofer in #2617 The old 'fixed but not fixed' code was not able to handle different kinds of measurements and added rows filled with nan values for every kind of measurment after the first one.

heckstrahler avatar Jun 16 '25 14:06 heckstrahler

Maybe I am just missing something, but I don't get how this change would not be functionally equivalent (for pandas 2). It's a one-line assignement, there are no side effects either before or after the change, and it fits the basic example in the pandas CoW migrations docs.

It seems somewhat unlikely that you hit a weird corner case here that no one else has run into.

The "fix of the fix" does not only change the assignment, but a whole bunch of other things around it, and changes to the structure of the incoming data might confuse things further (e.g. here or here).

Either way, if there is a risk of "wrongly fixing" chained assignments, it would be helpful to be able to tell people what exactly to look out for. To better understand the problem, could you isolate the change on develop like this:

# revert the latest fix
git revert cda2537
# revert only the .loc change in question
curl https://github.com/e2nIEE/pandapower/commit/f9499f500edbbb0226cf8eda7d391ce37976aa33.patch | git apply -R --include pandapower/converter/cim/cim2pp/convert_measurements.py

then check if the bug is still there. If it is not, post a test case / the different versions of the measurements-df if you want others to have a look as well.

jthurner avatar Jun 17 '25 07:06 jthurner

I tried to return chain assignments in the _copy_to_measurement function by

changing the code as follows(chain assignments - commented out):
class CreateMeasurements:

    def __init__(self, net: pandapowerNet, cim: Dict):
        self.logger = logging.getLogger(self.__class__.__name__)
        self.net = net
        self.cim = cim

        # temporary variables
        self.state = 0
        self.df = pd.DataFrame()

    def _set_measurement_element_datatype(self):
        self.net.measurement.element = self.net.measurement.element.astype(np.sctypeDict.get("UInt32"))

    def _copy_to_measurement(self, input_df: pd.DataFrame):
        pp_type = 'measurement'
        self.logger.debug("Copy %s datasets to pandapower network with type %s" % (input_df.index.size, pp_type))
        if pp_type not in self.net.keys():
            self.logger.warning("Missing pandapower type %s in the pandapower network!" % pp_type)
            return
        #'''
        start_index_pp_net = self.net[pp_type].index.size
        self.net[pp_type] = pd.concat([self.net[pp_type], pd.DataFrame(None, index=[list(range(input_df.index.size))])],
                                      ignore_index=True, sort=False)
        for one_attr in self.net[pp_type].columns:
            if one_attr in input_df.columns:
                self.net[pp_type].loc[start_index_pp_net:, one_attr] = input_df[one_attr][:]
                #self.net[pp_type][one_attr][start_index_pp_net:] = input_df[one_attr][:]
        #'''

        # fix 2617
        '''
        if input_df.empty:
            return
        self.net[pp_type] = pd.concat([self.net[pp_type],
                                      input_df[list(set(self.net[pp_type].columns).intersection(input_df.columns))]],
                                      ignore_index=True, sort=False)
        '''
        self.state += 1
        if self.state > 1:
             print('equals !!!', self.net[pp_type].equals(self.df))
        if self.state == 15:
            print('equals !!!', self.net[pp_type].equals(self.df))
            self.net[pp_type].to_csv('loc.csv')
            #self.net[pp_type].to_csv('chained.csv')

        self.df = self.net[pp_type]

When running a test with chain assignments and separately with loc, I print the result after processing (at the end of the code), this happens 15 times, with fix2617 edits, printing happens once because of if input_df.empty:

pytest pandapower/test/converter/test_from_cim.py::test_Simbench_1_EHV_mixed__2_no_sw_res_gen -s

I also compare the result of the previous output with the current one: print('equals !!!', self.net[pp_type].equals(self.df))

and on the 15th call I wrote both variants to the files. The result of fix2617 is also written.

When comparing all three, the only difference is fix2617['element'] is of type int64 in chain assignments, loc float64 and that in older versions there is no check for empty input_df, which is why the code is executed completely 15 times.

Is there any special case to reproduce the error when using loc?

quant12345 avatar Jun 17 '25 10:06 quant12345

difference is fix2617['element'] is of type int64 in chain assignments, loc float64

Did you see the different dtypes when comparing the isolated chained assignment change? Or only with fix2617 (possibly from the empty df check)?

jthurner avatar Jun 23 '25 07:06 jthurner

difference is fix2617['element'] is of type int64 in chain assignments, loc float64

Did you see the different dtypes when comparing the isolated chained assignment change? Or only with fix2617 (possibly from the empty df check)?

The dataframe column types (chained and loc) match. They do not match the fix2617 dataframe. I used the code provided to write three dataframes separately to three files: chained, loc, fix2617. Checked with compare and it returns an empty dataframe when compared. But equals returns False. I also checked the column names, indexes, shape, they are identical for all dataframes, except for the column types. Here is the result of comparing the column types:

print('chained.equals(fix2617): ', chained.equals(fix2617))            
print('loc.equals(fix2617) :', loc.equals(fix2617))                    
print('chained.equals(loc) :', chained.equals(loc))                    
                                                                                                                        
print('chained.dtypes == fix2617.dtypes :\n', chained.dtypes == fix2617.dtypes) 
print('loc.dtypes == fix2617.dtypes :\n', loc.dtypes == fix2617.dtypes)         
print('chained.dtypes == loc.dtypes :\n', chained.dtypes == loc.dtypes)           
click output:
chained.equals(fix2617):  False
loc.equals(fix2617) : False
chained.equals(loc) : True


chained.dtypes == fix2617.dtypes :
 Unnamed: 0           True
name                 True
measurement_type     True
element_type         True
element             False
value                True
std_dev              True
side                 True
origin_id            True
origin_class         True
source               True
analog_id            True
terminal_id          True
description          True
dtype: bool
loc.dtypes == fix2617.dtypes :
 Unnamed: 0           True
name                 True
measurement_type     True
element_type         True
element             False
value                True
std_dev              True
side                 True
origin_id            True
origin_class         True
source               True
analog_id            True
terminal_id          True
description          True
dtype: bool
chained.dtypes == loc.dtypes :
 Unnamed: 0          True
name                True
measurement_type    True
element_type        True
element             True
value               True
std_dev             True
side                True
origin_id           True
origin_class        True
source              True
analog_id           True
terminal_id         True
description         True
dtype: bool

If you change fix2617['element'] to float64, then equals returns True.

With the old code in both versions the element column initially has the type: uint32 and changes after this line:

self.net[pp_type] = pd.concat([self.net[pp_type], pd.DataFrame(None, index=[list(range(input_df.index.size))])],
                                      ignore_index=True, sort=False)

quant12345 avatar Jun 23 '25 09:06 quant12345

Here is a small example of a difference in these calls that could lead to different results:

from pandas import DataFrame

df1 = DataFrame({'1': [.1, .2, .3], '2': [1, 4, 5]})
df2 = DataFrame({'1': ['a', 'b', 'c', 'd', 'e'], '2': [1, 2, 3, 4, 5]})
df4 = DataFrame(df2)

df4.loc[1:, '2'] = df1['1'][:]
df2['2'][1:] = df1['1'][:]
For a more detailed explanation click here A more detailed explanation: The df1 has 3 values per column, so `df1['1'][:]` has length 3 but `df4.loc[1:, '2']` is a slice of 4 values so if all values are to be overwritten it would overwrite the last value with something not contained in df1 aka. nan. Intuitively one would expect it to take the values and apply them like a copy paste in excel and only overwrite for the length given leaving the last value untouched.

The first call using .loc with a slice indexer adds all values to the output df overwriting anything past the length of the value with nan. The second call fails because the selected section of df2 is longer than the values provided.

While this might not be the exact scenario we had here, it is similar enough though. If the error got caught before and ignored or acted upon it would now lead to a different outcome than overwriting stuff with nan. So I also prefer not to encourage everyone to go fix these issues but to have several devs who understand this to go trough the warnings and change them accordingly. Not every place these warnings were thrown has enough tests to catch such edge cases.

KS-HTK avatar Jul 18 '25 09:07 KS-HTK

Can do something similar to using chain assignments by using iloc instead of loc:

# get the column index
ind_col = df4.columns.get_loc('2')

df4.iloc[1:, ind_col] = np.array(df1['1'][:])

I'm not sure if this is necessary, but you can fill in the slice df4['2'] based on the length of df1['1'] (the slice for loc is taken inclusively, that is, from rows(index) 1 to 3):

sub = np.array(df1['1'][:])
start_slice = 1

df4.loc[start_slice : len(sub), '2'] = sub
Potential problems with loc:

Another potential problem with loc. Let's say in df1 the indices are different from df4, then there will be an incorrect assignment:

df1 = pd.DataFrame({'1': [.1, .2, .3], '2': [1, 4, 5]}, index=[5, 6, 7])
df4.loc[2:, '2'] = df1['1'][:]

quant12345 avatar Jul 19 '25 11:07 quant12345