E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

EAMxx: wrong p3 diagnostic tendencies

Open mahf708 opened this issue 2 months ago • 5 comments

Po-Lun (@polunma) from the @eagles-project reported to Hassan and Naser a suspicious piece of code in P3 C++. In summary, it looks like qc_sed which was added a while back may not be correct.

Consider the following logic:

https://github.com/E3SM-Project/E3SM/blob/814f4153e7bc2ed32aaba6fad3afd45b7bd5f46a/components/eamxx/src/physics/p3/impl/p3_cloud_sed_impl.hpp#L141

where qc_tend is pointing to the underlying data of the field qc_sed. The problem arises because qc_tend isn't initialized to qc_previous. The same pattern exists in P3 F90, but crucially there qc_tend is initialized to qc_previous, and so (qc - qc_previous)/dt is the correct tendency. In the C++ version, we are subtracting qc_tend recursively, which is wrong.

This pattern exists beyond the specific example above, and so far we are only noticing it in diagnostic tendencies. The proposed solution to this is twofold:

  1. Survey the code for this pattern and ensure all instances are accounted for (including once with tend_ignore constructs)
  2. Ensure the tendency calculation is done such that it is (entity-entity_previous)/dt. One way to achieve this is to either (deep) copy the entity into entity_tend atop the calculation or add a new wsm var to house entity_previous

Credit to @polunma and @eagles-project for investigating and reporting this bug, along with a clear and detailed evidence and solution plan! Thanks!!!

mahf708 avatar Oct 24 '25 00:10 mahf708

It's odd that valgrind would not catch this kind of error. Perhaps all our kokkos views get inited to 0 at construction. We should prob avoid that, and use the ViewAllocateWithoutInitializing arg to the view constructor. @tcclevenger can you confirm that kokkos by default 0-initializes the allocated memory?

bartgol avatar Oct 24 '25 03:10 bartgol

@bartgol I thought Kokkos defaults to no initialization. But it looks like just because you add the Kokkos::WithoutInitializing flag, you may still get 0 values. This is probably a host-specific thing?

tcclevenger avatar Oct 24 '25 13:10 tcclevenger

Note a lot of these are explicitly initialized to zero somewhere early

How do we feel about asking good bot 🤖 to take a stab at this? We need to sieve through the code for any type of this pattern... if yes, I will edit the description so that it's more helpful to my dear friend 🤖

mahf708 avatar Oct 24 '25 13:10 mahf708

We do not use Kokkos::WithoutInitializing so we do get whatever the default allocation is. I always assumed kokkos inits to 0, since that was the only way I could explain the existence of Kokkos::WithoutInitializing...

This is something I thought about doing for a while. Namely, I wanted to change allocate_view so that it passes Kokkos::WithoutInitializing to the view ctor. The alternative was to init with NaN, but that may bite us back with some false FPE positve. Not initing should be enough to get valgrind to spot the errors.

Edit: sometimes you do get 0 values even if you don't init. Sometimes not. The reason is safety/security. When you get memory, you may get one of the following:

  • a new memory page from the OS: in this case, the OS ensures the memory is 0-ed out, to prevent snooping on other programs data
  • an old memory page that was already given to your program: in this case, the OS is not involved, and nobody 0s out the memory, resulting in garbage data.

Sometimes you have the impression your memory IS 0-ed out, but it's not. Valgrind, however, injects all alloc calls with some backtracing mechanism, so it is capable to detect if memory was not explicitly inited to 0 (i.e., valgrind does not consider (or know) whether the OS 0-ed out the memory page).

bartgol avatar Oct 24 '25 23:10 bartgol

I noticed this issue in 3 places: "p3_cloud_sed_impl.hpp" (which computes qc_tend and nc_tend), "p3_ice_sed_impl.hpp" (which computes qi_tend and ni_tend), "and "p3_rain_sed_impl.hpp" (which computes qi_tend and ni_tend).

polunma avatar Oct 27 '25 19:10 polunma