ccpp-physics icon indicating copy to clipboard operation
ccpp-physics copied to clipboard

GFS_phys_time_vary_init does not report errmsg/errflg correctly due to thread race condition

Open SamuelTrahanNOAA opened this issue 1 year ago • 0 comments

Description

Normally I don't cross-post bugs between forks, but this is a pretty big one. I want to make sure everyone is aware.

I reported it in the UFS fork already: https://github.com/ufs-community/ccpp-physics/issues/105

The GFS_phys_time_vary_init is parallelized using mpi sections, but it does not correctly handle errmsg or errflg. All threads update the same errmsg and errflg. That means a failure message can be overwritten by a success message in a later step.

To visualize this, suppose there are two threads running at once. For simplicity's sake, lets say there are only two initialization calls: init_that_fails() and init_that_succeeds()

Failure happens first

Events happened in this order:

Thread 1: Completes init_that_fails() and sets errmsg=1 Thread 2: Completes init_that_succeeds() and sets errmsg=0

The errmsg is 0 and the model will run even though one of the initialization steps failed.

Failure happens second

Events happened in this order:

Thread 2: Completes init_that_succeeds() and sets errmsg=0 Thread 1: Completes init_that_fails() and sets errmsg=1

The errmsg is 1 so the model will abort as expected.

Steps to Reproduce

Please provide detailed steps for reproducing the issue.

  1. Delete noahmptable.tbl
  2. Use a scheme that does not require that file.
  3. Run the model a few times with at least two threads.
  4. Notice that it fails sporadically instead of 100% of the time.

Additional Context

This was discovered in an RRFS parallel. The machine, compiler, etc. doesn't matter. However, the easiest way to see it is to run a non-NOAHMP suite without noahmptable.tbl.

SamuelTrahanNOAA avatar Sep 21 '23 16:09 SamuelTrahanNOAA