FMS
FMS copied to clipboard
Uncovered code path in adjoint global sum
All,
Recently here at GEOS we are trying using a newer (for us) version of FMS (see #155) based on Xanadu for testing with MOM6 (and soon might move to something based off of your master
). To that end, we tried running our ADAS (atmospheric data assimilation system) with this newer FMS and found that it crashed. To wit:
FATAL from PE 12: MPP_GLOBAL_SUM_: BITWISE_EFP_SUM is only implemented for real number, contact developer
The user contacted me, and I wended my way through the FMS code and am pretty sure this line is coming from:
https://github.com/NOAA-GFDL/FMS/blob/84791ab270a6f40a05816ad6a8f54d7dddc76cbf/mpp/include/mpp_global_sum_ad.h#L127-L138
(Note: I know it says MPP_GLOBAL_SUM_
but some/a couple of the adjoint files need the strings fixed to be something like MPP_GLOBAL_SUM_AD_
.)
This surprised us at first since the code in GEOS calling this hadn't changed in years! But it turns out our FMS had a really old version of this file where the only code paths were BITWISE_EXACT_SUM
or else
so we fell over. We were asking for BITWISE_EFP_SUM
but the old if-block didn't have an entry. New FMS does.
Doing an rg
on the FMS source I found that DO_EFP_SUM_
only exists in three files: mpp/include/mpp_domains_reduce.inc
, mpp/include/mpp_global_sum.h
, and mpp/include/mpp_global_sum_ad.h
. We use BITWISE_EFP_SUM
in our FV3 core, so that works, so I looked at mpp/include/mpp_domains_reduce.inc
and it looks like you have #define DO_EFP_SUM_
for the non-adjoint real sums:
https://github.com/NOAA-GFDL/FMS/blob/84791ab270a6f40a05816ad6a8f54d7dddc76cbf/mpp/include/mpp_domains_reduce.inc#L176-L189
but the adjoint ones don't:
https://github.com/NOAA-GFDL/FMS/blob/84791ab270a6f40a05816ad6a8f54d7dddc76cbf/mpp/include/mpp_domains_reduce.inc#L525-L537
So, I added some #define DO_EFP_SUM_
and #undef DO_EFP_SUM_
weaving through the reals and complex in mpp_domains_reduce.inc
and if nothing else it let our ADAS run (the evaluation of the results is ongoing).
This leads to the first question: Was there a reason the DO_EFP_SUM_
path was never enabled for the adjoint sum? If the EFP math in mpp_global_sum_ad.h
is incorrect, I can see making sure no one used it by not adding #define
in mpp_domains_reduce.inc
. You error out so it forces one to use the BITWISE_EXACT_SUM
or the non-exact. If so, I can talk with our developers and look at changing our calling code.
Or was this just a "no one has ever called this routine" sort of thing? If so, I could file a pull request with my fix.
We are reviewing this internally and we’ll let you know what we discover.
@mathomp4 I'll check with our internal developers if they got any information on this. Last I heard, they were unsure where this code came from, and are unable to verify if it is a problem. We'll keep you apprised as we find out more information.
Ooof. This is an old one. @danholdaway do you know if your latest adjoint fixes "fixed" this? Or perhaps it was fixed loooooooong ago?
@mathomp4 can you diff the offending file in GEOSadas with the one in this directory: https://github.com/JCSDA-internal/fv3-jedi-linearmodel/tree/develop/src/dynamics/atmos_cubed_sphere/model_tlmadm If there's a difference then it's possible it was fixed.
@mathomp4 @danholdaway do you have an update on this? I know this has fallen off the radar, but I want to check in and see if there's any progress.
I'm not sure if it is still relevant. Everything is good on the JEDI side. We have JEDI tested with fms 2022 versions and all is OK. Not sure if there remains an issue on the GEOS side.
I'm not sure if it is still relevant. Everything is good on the JEDI side. We have JEDI tested with fms 2022 versions and all is OK. Not sure if there remains an issue on the GEOS side.
Well, @danholdaway is the person I'd refer to about this anyway, so... 🤷🏼 But I don't see DO_EFP_SUM_
under the adjoint.
Perhaps we can close this then. It's been a few years and the problem hasn't been mentioned.
Are you running with a current-ish version of FMS now?
We have some branches in JEDI ready to go updating things to 2022.02. Though in develop we are using something significantly older.
Indeed. I hope to move GEOS to use FMS 2022.04 soon. The main issue is FMS 2022+ changes some interfaces and so I need to changes interfaces in our FV3 (which isn't the same as GFDL, has diverged some). Some changes are coming to our FV3 soon, so once that stabilizes, I'll try and move GEOS to FMS 2022.04 (or the newest) as soon as I can.
@mathomp4 are the interface changes related to IO (fms2_io) or something else?
@mathomp4 are the interface changes related to IO (fms2_io) or something else?
@thomas-robinson No. I think nest_level
got added to a bunch of mpp routines and I don't think it was optional. Or maybe we'd used keywords by that point and so even if nest_level
was optional, we had to include it due to Fortran rules?
Not sure. It's possible this was only during a certain version and it's changed in 2022.04? But we'll find out when we get there if it's still necessary.