ccpp-physics
ccpp-physics copied to clipboard
GFS_phys_time_vary_init does not report errmsg/errflg correctly due to thread race condition
Description
Normally I don't cross-post bugs between forks, but this is a pretty big one. I want to make sure everyone is aware.
I reported it in the UFS fork already: https://github.com/ufs-community/ccpp-physics/issues/105
The GFS_phys_time_vary_init is parallelized using mpi sections, but it does not correctly handle errmsg or errflg. All threads update the same errmsg and errflg. That means a failure message can be overwritten by a success message in a later step.
To visualize this, suppose there are two threads running at once. For simplicity's sake, lets say there are only two initialization calls: init_that_fails() and init_that_succeeds()
Failure happens first
Events happened in this order:
Thread 1: Completes init_that_fails() and sets errmsg=1 Thread 2: Completes init_that_succeeds() and sets errmsg=0
The errmsg is 0 and the model will run even though one of the initialization steps failed.
Failure happens second
Events happened in this order:
Thread 2: Completes init_that_succeeds() and sets errmsg=0 Thread 1: Completes init_that_fails() and sets errmsg=1
The errmsg is 1 so the model will abort as expected.
Steps to Reproduce
Please provide detailed steps for reproducing the issue.
- Delete noahmptable.tbl
- Use a scheme that does not require that file.
- Run the model a few times with at least two threads.
- Notice that it fails sporadically instead of 100% of the time.
Additional Context
This was discovered in an RRFS parallel. The machine, compiler, etc. doesn't matter. However, the easiest way to see it is to run a non-NOAHMP suite without noahmptable.tbl.