FMS icon indicating copy to clipboard operation
FMS copied to clipboard

Uncovered code path in adjoint global sum

Open mathomp4 opened this issue 5 years ago • 13 comments

All,

Recently here at GEOS we are trying using a newer (for us) version of FMS (see #155) based on Xanadu for testing with MOM6 (and soon might move to something based off of your master). To that end, we tried running our ADAS (atmospheric data assimilation system) with this newer FMS and found that it crashed. To wit:

FATAL from PE    12: MPP_GLOBAL_SUM_: BITWISE_EFP_SUM is only implemented for real number, contact developer

The user contacted me, and I wended my way through the FMS code and am pretty sure this line is coming from: https://github.com/NOAA-GFDL/FMS/blob/84791ab270a6f40a05816ad6a8f54d7dddc76cbf/mpp/include/mpp_global_sum_ad.h#L127-L138 (Note: I know it says MPP_GLOBAL_SUM_ but some/a couple of the adjoint files need the strings fixed to be something like MPP_GLOBAL_SUM_AD_.)

This surprised us at first since the code in GEOS calling this hadn't changed in years! But it turns out our FMS had a really old version of this file where the only code paths were BITWISE_EXACT_SUM or else so we fell over. We were asking for BITWISE_EFP_SUM but the old if-block didn't have an entry. New FMS does.

Doing an rg on the FMS source I found that DO_EFP_SUM_ only exists in three files: mpp/include/mpp_domains_reduce.inc, mpp/include/mpp_global_sum.h, and mpp/include/mpp_global_sum_ad.h. We use BITWISE_EFP_SUM in our FV3 core, so that works, so I looked at mpp/include/mpp_domains_reduce.inc and it looks like you have #define DO_EFP_SUM_ for the non-adjoint real sums: https://github.com/NOAA-GFDL/FMS/blob/84791ab270a6f40a05816ad6a8f54d7dddc76cbf/mpp/include/mpp_domains_reduce.inc#L176-L189 but the adjoint ones don't: https://github.com/NOAA-GFDL/FMS/blob/84791ab270a6f40a05816ad6a8f54d7dddc76cbf/mpp/include/mpp_domains_reduce.inc#L525-L537

So, I added some #define DO_EFP_SUM_ and #undef DO_EFP_SUM_ weaving through the reals and complex in mpp_domains_reduce.inc and if nothing else it let our ADAS run (the evaluation of the results is ongoing).

This leads to the first question: Was there a reason the DO_EFP_SUM_ path was never enabled for the adjoint sum? If the EFP math in mpp_global_sum_ad.h is incorrect, I can see making sure no one used it by not adding #define in mpp_domains_reduce.inc. You error out so it forces one to use the BITWISE_EXACT_SUM or the non-exact. If so, I can talk with our developers and look at changing our calling code.

Or was this just a "no one has ever called this routine" sort of thing? If so, I could file a pull request with my fix.

mathomp4 avatar Sep 19 '19 13:09 mathomp4

We are reviewing this internally and we’ll let you know what we discover.

underwoo avatar Nov 06 '19 17:11 underwoo

@mathomp4 I'll check with our internal developers if they got any information on this. Last I heard, they were unsure where this code came from, and are unable to verify if it is a problem. We'll keep you apprised as we find out more information.

underwoo avatar Nov 27 '19 19:11 underwoo

Ooof. This is an old one. @danholdaway do you know if your latest adjoint fixes "fixed" this? Or perhaps it was fixed loooooooong ago?

mathomp4 avatar Sep 28 '21 15:09 mathomp4

@mathomp4 can you diff the offending file in GEOSadas with the one in this directory: https://github.com/JCSDA-internal/fv3-jedi-linearmodel/tree/develop/src/dynamics/atmos_cubed_sphere/model_tlmadm If there's a difference then it's possible it was fixed.

danholdaway avatar Sep 29 '21 15:09 danholdaway

@mathomp4 @danholdaway do you have an update on this? I know this has fallen off the radar, but I want to check in and see if there's any progress.

thomas-robinson avatar Feb 09 '23 12:02 thomas-robinson

I'm not sure if it is still relevant. Everything is good on the JEDI side. We have JEDI tested with fms 2022 versions and all is OK. Not sure if there remains an issue on the GEOS side.

danholdaway avatar Feb 09 '23 14:02 danholdaway

I'm not sure if it is still relevant. Everything is good on the JEDI side. We have JEDI tested with fms 2022 versions and all is OK. Not sure if there remains an issue on the GEOS side.

Well, @danholdaway is the person I'd refer to about this anyway, so... 🤷🏼 But I don't see DO_EFP_SUM_ under the adjoint.

mathomp4 avatar Feb 09 '23 15:02 mathomp4

Perhaps we can close this then. It's been a few years and the problem hasn't been mentioned.

danholdaway avatar Feb 09 '23 15:02 danholdaway

Are you running with a current-ish version of FMS now?

thomas-robinson avatar Feb 09 '23 16:02 thomas-robinson

We have some branches in JEDI ready to go updating things to 2022.02. Though in develop we are using something significantly older.

danholdaway avatar Feb 09 '23 19:02 danholdaway

Indeed. I hope to move GEOS to use FMS 2022.04 soon. The main issue is FMS 2022+ changes some interfaces and so I need to changes interfaces in our FV3 (which isn't the same as GFDL, has diverged some). Some changes are coming to our FV3 soon, so once that stabilizes, I'll try and move GEOS to FMS 2022.04 (or the newest) as soon as I can.

mathomp4 avatar Feb 10 '23 13:02 mathomp4

@mathomp4 are the interface changes related to IO (fms2_io) or something else?

thomas-robinson avatar Feb 10 '23 13:02 thomas-robinson

@mathomp4 are the interface changes related to IO (fms2_io) or something else?

@thomas-robinson No. I think nest_level got added to a bunch of mpp routines and I don't think it was optional. Or maybe we'd used keywords by that point and so even if nest_level was optional, we had to include it due to Fortran rules?

Not sure. It's possible this was only during a certain version and it's changed in 2022.04? But we'll find out when we get there if it's still necessary.

mathomp4 avatar Feb 10 '23 14:02 mathomp4