cosima-recipes New python kernels silently give different results in `Cross-contour

When using a kernel newer than analyis3-23.04, the cross contour transport calculated in this notebook is wrong (see black line compared to blue line where kernel analyis3-23.04 was used).

If anyone is keen to look into what has changed in the new python versions and packages that changes the results, that would be great and please assign yourself. Otherwise I will add a sentence about which kernel to use at the top of the notebook for now and look into it during the COSIMA hackathon.

edit: changed kernel from analysis3-22.07 to analysis3-23.04

Apr 03 '24 00:04 schmidt-christina

Is it possible to tell which step of the code changed? i.e. at which point do the answers diverge?

Apr 03 '24 00:04 adele-morrison

And does someone have the list of which packages changed between analysis3-22.07 and analysis3-22.10?

Apr 03 '24 00:04 adele-morrison

Are you getting any new warnings? My guess is there is some change in masking or handing of NaNs.

@adele-morrison : ~This is the literal answer to your Q.~ But @dsroberts can advise better I think.

EDIT: Diff for 23.04 to 23.07: conda_diff_04_to_07.txt

Apr 03 '24 01:04 anton-seaice

And does someone have the list of which packages changed between analysis3-22.07 and analysis3-22.10?

Sorry, I just looked into it again. The change happened between analysis3-23.04 and analysis3-23.07

@anton-seaice There is no warning using the newer kernels except for one on chunking affecting performance.

Apr 03 '24 01:04 schmidt-christina

Hi @schmidt-christina, @anton-seaice . The only major issue I'm aware of between those two revisions is the change in default chunking that affects the ERA5 data set. I've not seen that change any numerical output, just how long it takes to get there. Would be interesting to look into, which notebook is it?

Apr 03 '24 01:04 dsroberts

This one: https://github.com/COSIMA/cosima-recipes/blob/main/DocumentedExamples/Cross-contour_transport.ipynb

Apr 03 '24 01:04 adele-morrison

Probably easiest to test in the notebook Hannah created when she came across this issue (it's a condensed version of the notebook Adele pointed to with only the necessary code for the selection of T/S on the contour and binning into density classes), but it won't allow me to upload it here. Should I send it via email @dsroberts ?

Apr 03 '24 01:04 schmidt-christina

@schmidt-christina, nice instructions here from @aidanheerdegen on how to share code: https://forum.access-hive.org.au/t/how-to-share-your-jupyter-notebook/692

Maybe others can help also if we can make it public.

Apr 03 '24 02:04 adele-morrison

I uploaded the notebook here

Apr 03 '24 02:04 schmidt-christina

Good catch @schmidt-christina and thanks for raising this issue! This is quite alarming tbh that something would silently start giving different results! Could you confirm that you don't have any warnings imposed to be silent (sometimes we do that in the beginning because warnings can overwhelm us -- I know I do at least!)?

Apr 03 '24 04:04 navidcy

I think we should post this in the forum -- others might be getting strange results? I'm quite alarmed actually...

I was also getting different results in https://github.com/COSIMA/cosima-recipes/pull/322 but I (thought I) was convinced it was because the methods for computing meridional heat transport depend on the reference temperature used. I will revisit the PR with this in mind and try using an older conda env....

Apr 03 '24 04:04 navidcy

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/newest-conda-env-kernels-silently-change-model-output-analysis-results/2010/1

Apr 03 '24 04:04 access-hive-bot

@schmidt-christina (or @salvajoe, @willaguiar, @hrsdawson), it would be extremely useful if you could strip down the notebook to what we call Minimal Working Example (MWE). That is, the minimum code for which the problem arises using kernel A and not for kernel B. The notebook at the moment has a lot of things -- are all of them needed to showcase the problem? Perhaps they are, I don't know. If they are then leave them!

[Usually this process of stripping down to a minimal working example helps a lot pin-pointing the issue and, sometimes, even finding the solution!]

I know I'm asking you to do work but could you please? This seems an important issue. Results that change silently after upgrading packages are very very sneaky and insidious and could result for a lot of struggles by some others who could be probably thinking "oh.. this must be my fault for not getting it right...". Or this could result to publishing wrong results!

On another note. What do lines:

import logging
logging.captureWarnings(True)
logging.getLogger('py.warnings').setLevel(logging.ERROR)

do? Do they suppress warnings? If we remove them do you get any meaningful warning perhaps?

Apr 03 '24 13:04 navidcy

Hi all. I've managed to reproduce the incorrect behaviour in the analysis3/23.04 kernel. The issue seems to be that the sortby() function used in line 17 of cell 11 results in different output between the two kernels: contour_index_2304_sorted vol_trans_across_contour.sortby(contour_ordering)[0].plot() analysis3/23.04 contour_index_2307_sorted vol_trans_across_contour.sortby(contour_ordering)[0].plot() analysis3/23.07 contour_index_2304_unsorted vol_trans_across_contour[0].plot() analysis3/23.04

So it seems like sortby() is doing nothing in the newer kernels. Unfortunately I don't know why that is yet, nor how to fix it. I just wanted to give you all a quick update before I head out for the rest of the day.

Apr 03 '24 23:04 dsroberts

are all of them needed to showcase the problem?

All of the calculations in the notebook are needed (transport across an isobath binned into density classes is quite a long and complicated calculation). calculations are only done for one kernel and then we just load data saved from my 2023 study to compare against.

On another note. What do lines:
import logging
logging.captureWarnings(True)
logging.getLogger('py.warnings').setLevel(logging.ERROR)
do? Do they suppress warnings? If we remove them do you get any meaningful warning perhaps?

I tested that and removed the lines, but still no additional warnings.

Apr 03 '24 23:04 schmidt-christina

thanks @schmidt-christina for checking!

thanks @dsroberts for figuring out -- holding onto my chair for the what else might this investigation bring about and the end of the episode with dataarray.sortby()

Apr 04 '24 06:04 navidcy

OK, I think I've figured it out. The issue of why .sortby() doesn't do anything boils down to these lines here:

# get lat and lon along contour, useful for plotting later:
lat_along_contour = contour_ordering.y_ocean
lon_along_contour = contour_ordering.x_ocean
contour_index_array = np.arange(1, len(contour_ordering)+1)
# don't need the multi-index anymore, replace with contour count and save
lat_along_contour.coords['contour_index'] = contour_index_array
lon_along_contour.coords['contour_index'] = contour_index_array

tl;dr - the solution is to add .copy() onto the end of lat_along_contour=... and lon_along_contour= i.e.

lat_along_contour = contour_ordering.y_ocean.copy()
lon_along_contour = contour_ordering.x_ocean.copy()

The reason for this is a bit of a journey.

The difference in behaviour between the two conda environments comes down to the construction of the DataArrays on the right hand side of the equals sign on first 2 lines above. In 23.04, the DataArray created by running contour_ordering.y_ocean contains a copy of all of the objects needed from the original contour_ordering DataArray, including the pandas.MultiIndex object responsible for linking the contour_index coordinate to the data. In 23.07, this is not the case, the pandas.MultiIndex object associated with contour_ordering.y_ocean is a reference to that object in the contour_ordering DataArray. You can see this by using the is operator:

### 23.04
>>> lat_along_contour.get_index('contour_index') is contour_ordering.get_index('contour_index')
False

### 23.07
>>> lat_along_contour.get_index('contour_index') is contour_ordering.get_index('contour_index')
True

So when the following code runs:

lat_along_contour.coords['contour_index'] = contour_index_array

The original pandas.MultiIndex is deleted, as it believed to no longer be necessary, which means that in 23.07 the contour_ordering also loses its index, and has it replaced with a default pandas.RangeIndex. This explains why .sortby(contour_index) does nothing. The .sortby() function first attempts to align the contour_index DataArray to the vol_trans_across_contour DataArray, and it aligns the coordinates successfully in both cases, however, because none of the elements in the newly-created pandas.RangeIndex appears in the in the vol_trans_across_contour DataArray's pandas.MultiIndex in 23.07, it is unable to change the data along with the contour_index coordinate. Funnily enough, the actual np.lexsort() call within .sortby() does nothing in your case, the xr.align() call does all the work. The screenshot below demonstrates this, align

Note how the coordinates of contour_order and b have been changed after the alignment in both notebooks (analysis3-23.04 on the left, analysis3-23.07 on the right), but array on the right has not been changed.

Anyway, the solution is to add .copy() onto the end of lat_along_contour=... and lon_along_contour= i.e.

lat_along_contour = contour_ordering.y_ocean.copy()
lon_along_contour = contour_ordering.x_ocean.copy()

This a copy the objects within the contour_ordering.y_ocean DataArray, and therefore a copy of the original pandas.MultiIndex is created and assigned to lat_along_contour. This means the original index object in contour_ordering is not removed when the coords in lat_along_contour and lon_along_contour are modified, and xr.align() and therefore .sortby() works as expected. I think this behaviour is erroneous, as an object has been deleted whilst there are still active references to it, so I think its worth me submitting an issue on the xarray github, assuming it hasn't already been raised.

Apr 12 '24 04:04 dsroberts

Nice!

Apr 12 '24 04:04 adele-morrison

Omg. Excellent detective work

Apr 12 '24 04:04 navidcy

Nice work @dsroberts !!

Apr 12 '24 06:04 AndyHoggANU

So I've dug a little further and tried to make a reproducer, and it turns out that xarray does warn you that this might cause problems.

>>> c.coords['contour_index'] = np.arange(1,len(yc)+1)
/jobfs/113635825.gadi-pbs/ipykernel_1298317/4250819915.py:1: FutureWarning: updating coordinate 'contour_index' with a PandasMultiIndex would leave the multi-index level coordinates ['y_ocean', 'x_ocean'] in an inconsistent state. This will raise an error in the future. Use `.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])` before assigning new coordinate values.
  c.coords['contour_index'] = np.arange(1,len(yc)+1)

And it does, the PandasMultiIndex is indeed in an inconsistent state. However, when I run import cosima_cookbook as cc, that warning disappears. This line has the effect of suppressing everything coming out of warnings.warn() (not just from cosima cookbook) unless a user knows to initialise the py.warnings logger. Which, to be fair, the original author of this notebook attempted to do, but did not add a handler, so the warnings were going into the void. Placing the following

import logging
console_handler = logging.StreamHandler(sys.stderr)
logger = logging.getLogger("py.warnings")
logger.addHandler(console_handler)

In the first cell of @schmidt-christina's notebook causes that warning to appear in a few places. Given this, I think the best place to raise an issue would be over on the COSMIA cookbook repo, as xarray did try to warn us, even on analysis3-23.04, except warnings were turned off. The fix suggested in the warning (.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])) does work for the notebook i.e.:

# get lat and lon along contour, useful for plotting later:
lat_along_contour = contour_ordering.y_ocean.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])
lon_along_contour = contour_ordering.x_ocean.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])
contour_index_array = np.arange(1, len(contour_ordering)+1)
# don't need the multi-index anymore, replace with contour count and save
lat_along_contour.coords['contour_index'] = contour_index_array
lon_along_contour.coords['contour_index'] = contour_index_array

and is a more appropriate solution than using .copy(). Dropping vars removes any copy or reference to the pandas.MultiIndex object from contour_ordering so it can't get messed up when updating the coordinates.

Apr 12 '24 07:04 dsroberts

@schmidt-christina would you like to open a PR to fix this example?

Apr 15 '24 05:04 navidcy

@adele-morrison I'm hoping we fix this before the hackathon; this is quite urgent. At the moment we have online an example that silently gives wrong results and we are potentially creating nightmares for people.

@dsroberts did an amazing detective job upstairs figuring the issue out

could someone open a PR and fix the example?

cc @willaguiar, @schmidt-christina, @salvajoe

Apr 16 '24 05:04 navidcy

Even a PR that simply deletes the example is better than having something there which is wrong and makes people loose sleep or potentially deteriorates mental health...

If nobody has the capacity to fix it I can open a PR to delete the example and whenever we have the capacity we can put it back corrected.

Apr 16 '24 05:04 navidcy

Even a PR that simply deletes the example is better

Hmmm, this seems a little drastic. What about a PR that says analysis3-23.04 or earlier is needed? And then we can fix properly at the hackathon?

Apr 16 '24 05:04 adele-morrison

A caveat message is also a good solution! And if we do that, then we can also link to this issue.

(Deleting doesn't mean much... the notebooks always stay in the Github history! We just don't want this to be advertised as an example that we present to people as "working out of the box" so that they take on and modify to continue their work.)

Apr 16 '24 05:04 navidcy

I just ran this using analysis3-24.01 and there is a different error / it is failing earlier.

The method to extract the points along the contour from matplotlib is failing. There should a second 'tomato' coloured line in this plot from the Select the contour step.

Screenshot 2024-04-17 at 5 00 18 PM

Apr 17 '24 07:04 anton-seaice

I just ran this using analysis3-24.01 and there is a different error / it is failing earlier.

@anton-seaice shall we raise a new issue?

Apr 17 '24 23:04 adele-morrison

I just ran this using analysis3-24.01 and there is a different error / it is failing earlier.

@anton-seaice shall we raise a new issue?

I don't mind, but it would "hopefully" get fixed in the same PR so I wouldn't ?

Apr 17 '24 23:04 anton-seaice

Regarding @anton-seaice's bug about finding the contour:

This now fails because sc.get_paths() now stacks all the contours instead of separating them into different paths. This is easy to fix by checking where np.diff(x_contour>1.01) for example.
But, with a new conda environment, the values of x_contour from sc.get_paths() are not integers, which is problematic for the current code that uses these as integers. The values of x_contour don't increase by more than 1, so possibly we can just convert them into integers and proceed.
Another problem (found while trying to convert this to use the Antarctic slope 1000m isobath instead of the Southern Ocean SSH one) is that this method of finding the contour results in a contour that is on the land/sea mask east of the Weddell. I think we might need to proceed just ignoring this for now though.

Jul 01 '24 06:07 adele-morrison

New python kernels silently give different results in `Cross-contour_transport.ipynb`