cosima-recipes
cosima-recipes copied to clipboard
New python kernels silently give different results in `Cross-contour_transport.ipynb`
When using a kernel newer than analyis3-23.04
, the cross contour transport calculated in this notebook is wrong (see black line compared to blue line where kernel analyis3-23.04
was used).
If anyone is keen to look into what has changed in the new python versions and packages that changes the results, that would be great and please assign yourself. Otherwise I will add a sentence about which kernel to use at the top of the notebook for now and look into it during the COSIMA hackathon.
edit: changed kernel from analysis3-22.07
to analysis3-23.04
Is it possible to tell which step of the code changed? i.e. at which point do the answers diverge?
And does someone have the list of which packages changed between analysis3-22.07
and analysis3-22.10
?
Are you getting any new warnings? My guess is there is some change in masking or handing of NaNs.
@adele-morrison : ~This is the literal answer to your Q.~ But @dsroberts can advise better I think.
EDIT: Diff for 23.04 to 23.07: conda_diff_04_to_07.txt
And does someone have the list of which packages changed between
analysis3-22.07
andanalysis3-22.10
?
Sorry, I just looked into it again. The change happened between analysis3-23.04
and analysis3-23.07
@anton-seaice There is no warning using the newer kernels except for one on chunking affecting performance.
Hi @schmidt-christina, @anton-seaice . The only major issue I'm aware of between those two revisions is the change in default chunking that affects the ERA5 data set. I've not seen that change any numerical output, just how long it takes to get there. Would be interesting to look into, which notebook is it?
This one: https://github.com/COSIMA/cosima-recipes/blob/main/DocumentedExamples/Cross-contour_transport.ipynb
Probably easiest to test in the notebook Hannah created when she came across this issue (it's a condensed version of the notebook Adele pointed to with only the necessary code for the selection of T/S on the contour and binning into density classes), but it won't allow me to upload it here. Should I send it via email @dsroberts ?
@schmidt-christina, nice instructions here from @aidanheerdegen on how to share code: https://forum.access-hive.org.au/t/how-to-share-your-jupyter-notebook/692
Maybe others can help also if we can make it public.
I uploaded the notebook here
Good catch @schmidt-christina and thanks for raising this issue! This is quite alarming tbh that something would silently start giving different results! Could you confirm that you don't have any warnings imposed to be silent (sometimes we do that in the beginning because warnings can overwhelm us -- I know I do at least!)?
I think we should post this in the forum -- others might be getting strange results? I'm quite alarmed actually...
I was also getting different results in https://github.com/COSIMA/cosima-recipes/pull/322 but I (thought I) was convinced it was because the methods for computing meridional heat transport depend on the reference temperature used. I will revisit the PR with this in mind and try using an older conda env....
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
https://forum.access-hive.org.au/t/newest-conda-env-kernels-silently-change-model-output-analysis-results/2010/1
@schmidt-christina (or @salvajoe, @willaguiar, @hrsdawson), it would be extremely useful if you could strip down the notebook to what we call Minimal Working Example (MWE). That is, the minimum code for which the problem arises using kernel A and not for kernel B. The notebook at the moment has a lot of things -- are all of them needed to showcase the problem? Perhaps they are, I don't know. If they are then leave them!
[Usually this process of stripping down to a minimal working example helps a lot pin-pointing the issue and, sometimes, even finding the solution!]
I know I'm asking you to do work but could you please? This seems an important issue. Results that change silently after upgrading packages are very very sneaky and insidious and could result for a lot of struggles by some others who could be probably thinking "oh.. this must be my fault for not getting it right...". Or this could result to publishing wrong results!
On another note. What do lines:
import logging
logging.captureWarnings(True)
logging.getLogger('py.warnings').setLevel(logging.ERROR)
do? Do they suppress warnings? If we remove them do you get any meaningful warning perhaps?
Hi all. I've managed to reproduce the incorrect behaviour in the analysis3/23.04 kernel. The issue seems to be that the sortby()
function used in line 17 of cell 11 results in different output between the two kernels:
vol_trans_across_contour.sortby(contour_ordering)[0].plot()
analysis3/23.04
vol_trans_across_contour.sortby(contour_ordering)[0].plot()
analysis3/23.07
vol_trans_across_contour[0].plot()
analysis3/23.04
So it seems like sortby()
is doing nothing in the newer kernels. Unfortunately I don't know why that is yet, nor how to fix it. I just wanted to give you all a quick update before I head out for the rest of the day.
are all of them needed to showcase the problem?
All of the calculations in the notebook are needed (transport across an isobath binned into density classes is quite a long and complicated calculation). calculations are only done for one kernel and then we just load data saved from my 2023 study to compare against.
On another note. What do lines:
import logging logging.captureWarnings(True) logging.getLogger('py.warnings').setLevel(logging.ERROR)
do? Do they suppress warnings? If we remove them do you get any meaningful warning perhaps?
I tested that and removed the lines, but still no additional warnings.
thanks @schmidt-christina for checking!
thanks @dsroberts for figuring out -- holding onto my chair for the what else might this investigation bring about and the end of the episode with dataarray.sortby()
OK, I think I've figured it out. The issue of why .sortby()
doesn't do anything boils down to these lines here:
# get lat and lon along contour, useful for plotting later:
lat_along_contour = contour_ordering.y_ocean
lon_along_contour = contour_ordering.x_ocean
contour_index_array = np.arange(1, len(contour_ordering)+1)
# don't need the multi-index anymore, replace with contour count and save
lat_along_contour.coords['contour_index'] = contour_index_array
lon_along_contour.coords['contour_index'] = contour_index_array
tl;dr - the solution is to add .copy()
onto the end of lat_along_contour=...
and lon_along_contour=
i.e.
lat_along_contour = contour_ordering.y_ocean.copy()
lon_along_contour = contour_ordering.x_ocean.copy()
The reason for this is a bit of a journey.
The difference in behaviour between the two conda environments comes down to the construction of the DataArrays on the right hand side of the equals sign on first 2 lines above. In 23.04, the DataArray created by running contour_ordering.y_ocean
contains a copy of all of the objects needed from the original contour_ordering
DataArray, including the pandas.MultiIndex
object responsible for linking the contour_index
coordinate to the data. In 23.07, this is not the case, the pandas.MultiIndex
object associated with contour_ordering.y_ocean
is a reference to that object in the contour_ordering
DataArray. You can see this by using the is
operator:
### 23.04
>>> lat_along_contour.get_index('contour_index') is contour_ordering.get_index('contour_index')
False
### 23.07
>>> lat_along_contour.get_index('contour_index') is contour_ordering.get_index('contour_index')
True
So when the following code runs:
lat_along_contour.coords['contour_index'] = contour_index_array
The original pandas.MultiIndex
is deleted, as it believed to no longer be necessary, which means that in 23.07 the contour_ordering
also loses its index, and has it replaced with a default pandas.RangeIndex
. This explains why .sortby(contour_index)
does nothing. The .sortby()
function first attempts to align the contour_index
DataArray to the vol_trans_across_contour
DataArray, and it aligns the coordinates successfully in both cases, however, because none of the elements in the newly-created pandas.RangeIndex
appears in the in the vol_trans_across_contour
DataArray's pandas.MultiIndex
in 23.07, it is unable to change the data along with the contour_index
coordinate. Funnily enough, the actual np.lexsort()
call within .sortby()
does nothing in your case, the xr.align()
call does all the work. The screenshot below demonstrates this,
Note how the coordinates of contour_order
and b
have been changed after the alignment in both notebooks (analysis3-23.04 on the left, analysis3-23.07 on the right), but array
on the right has not been changed.
Anyway, the solution is to add .copy()
onto the end of lat_along_contour=...
and lon_along_contour=
i.e.
lat_along_contour = contour_ordering.y_ocean.copy()
lon_along_contour = contour_ordering.x_ocean.copy()
This a copy the objects within the contour_ordering.y_ocean
DataArray, and therefore a copy of the original pandas.MultiIndex
is created and assigned to lat_along_contour
. This means the original index object in contour_ordering
is not removed when the coords in lat_along_contour
and lon_along_contour
are modified, and xr.align()
and therefore .sortby()
works as expected. I think this behaviour is erroneous, as an object has been deleted whilst there are still active references to it, so I think its worth me submitting an issue on the xarray github, assuming it hasn't already been raised.
Nice!
Omg. Excellent detective work
Nice work @dsroberts !!
So I've dug a little further and tried to make a reproducer, and it turns out that xarray
does warn you that this might cause problems.
>>> c.coords['contour_index'] = np.arange(1,len(yc)+1)
/jobfs/113635825.gadi-pbs/ipykernel_1298317/4250819915.py:1: FutureWarning: updating coordinate 'contour_index' with a PandasMultiIndex would leave the multi-index level coordinates ['y_ocean', 'x_ocean'] in an inconsistent state. This will raise an error in the future. Use `.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])` before assigning new coordinate values.
c.coords['contour_index'] = np.arange(1,len(yc)+1)
And it does, the PandasMultiIndex
is indeed in an inconsistent state. However, when I run import cosima_cookbook as cc
, that warning disappears. This line has the effect of suppressing everything coming out of warnings.warn()
(not just from cosima cookbook) unless a user knows to initialise the py.warnings
logger. Which, to be fair, the original author of this notebook attempted to do, but did not add a handler, so the warnings were going into the void. Placing the following
import logging
console_handler = logging.StreamHandler(sys.stderr)
logger = logging.getLogger("py.warnings")
logger.addHandler(console_handler)
In the first cell of @schmidt-christina's notebook causes that warning to appear in a few places. Given this, I think the best place to raise an issue would be over on the COSMIA cookbook repo, as xarray
did try to warn us, even on analysis3-23.04, except warnings were turned off. The fix suggested in the warning (.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])
) does work for the notebook i.e.:
# get lat and lon along contour, useful for plotting later:
lat_along_contour = contour_ordering.y_ocean.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])
lon_along_contour = contour_ordering.x_ocean.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])
contour_index_array = np.arange(1, len(contour_ordering)+1)
# don't need the multi-index anymore, replace with contour count and save
lat_along_contour.coords['contour_index'] = contour_index_array
lon_along_contour.coords['contour_index'] = contour_index_array
and is a more appropriate solution than using .copy()
. Dropping vars removes any copy or reference to the pandas.MultiIndex
object from contour_ordering
so it can't get messed up when updating the coordinates.
@schmidt-christina would you like to open a PR to fix this example?
@adele-morrison I'm hoping we fix this before the hackathon; this is quite urgent. At the moment we have online an example that silently gives wrong results and we are potentially creating nightmares for people.
@dsroberts did an amazing detective job upstairs figuring the issue out
could someone open a PR and fix the example?
cc @willaguiar, @schmidt-christina, @salvajoe
Even a PR that simply deletes the example is better than having something there which is wrong and makes people loose sleep or potentially deteriorates mental health...
If nobody has the capacity to fix it I can open a PR to delete the example and whenever we have the capacity we can put it back corrected.
Even a PR that simply deletes the example is better
Hmmm, this seems a little drastic. What about a PR that says analysis3-23.04
or earlier is needed? And then we can fix properly at the hackathon?
A caveat message is also a good solution! And if we do that, then we can also link to this issue.
(Deleting doesn't mean much... the notebooks always stay in the Github history! We just don't want this to be advertised as an example that we present to people as "working out of the box" so that they take on and modify to continue their work.)
I just ran this using analysis3-24.01
and there is a different error / it is failing earlier.
The method to extract the points along the contour from matplotlib is failing. There should a second 'tomato' coloured line in this plot from the Select the contour step.
I just ran this using analysis3-24.01 and there is a different error / it is failing earlier.
@anton-seaice shall we raise a new issue?
I just ran this using analysis3-24.01 and there is a different error / it is failing earlier.
@anton-seaice shall we raise a new issue?
I don't mind, but it would "hopefully" get fixed in the same PR so I wouldn't ?
Regarding @anton-seaice's bug about finding the contour:
- This now fails because
sc.get_paths()
now stacks all the contours instead of separating them into different paths. This is easy to fix by checking wherenp.diff(x_contour>1.01)
for example. - But, with a new conda environment, the values of x_contour from
sc.get_paths()
are not integers, which is problematic for the current code that uses these as integers. The values of x_contour don't increase by more than 1, so possibly we can just convert them into integers and proceed. - Another problem (found while trying to convert this to use the Antarctic slope 1000m isobath instead of the Southern Ocean SSH one) is that this method of finding the contour results in a contour that is on the land/sea mask east of the Weddell. I think we might need to proceed just ignoring this for now though.