MFC icon indicating copy to clipboard operation
MFC copied to clipboard

Delta GPUs results give NaNs in the viscous sub-grid bubble benchmark case whereas Phoenix GPUs do not

Open sbryngelson opened this issue 1 year ago • 1 comments

Delta GPUs results give NaNs in the viscous sub-grid bubble benchmark case whereas Phoenix GPUs do not.

This is regardless of "memory size" (I've checked 4gb).

I've tested A100s and A40s on Delta, both give the issue discussed further on Slack.

I tested A100s and V100s on Phoenix, both of which do not give the issue.

Both computers use NVHPC 22.11.

Error is this:

 [ 40%]  Time step      358 of 901 @ t_step = 357
 [ 40%]  Time step      359 of 901 @ t_step = 358
 [ 40%]  Time step      360 of 901 @ t_step = 359
Warning: ieee_inexact is signaling
ERROR STOP NaN(s) in timestep output.
 NaN(s) in timestep output.            0            0            0            1
             0          360          198           99           99
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

One can run this case via something like ./mfc.sh run benchmarks/viscous_weno5_sgb_mono/case.py 4 -t pre_process simulation -c delta --gpu if you are already on a node with GPUs and have loaded the appropriate modules.

sbryngelson avatar Apr 12 '24 18:04 sbryngelson

Update: This issue is associated with parallel_IO = 'T'. I witnessed it again on a Rogues Gallery GH200 chip w/ NVHPC 24.1.

sbryngelson avatar Apr 17 '24 11:04 sbryngelson

I'm not sure if this is still "broken" or not.

sbryngelson avatar May 22 '24 17:05 sbryngelson

Update: This is still broken. Related to PR #425

Update 2: This does not fail when case optimization is disabled. It only fails with case optimization enabled (on non-Phoenix computers).

I get the feeling that this line is not actually invoking case optimization....

https://github.com/MFlowCode/MFC/blob/4f89f33739da7df6a74151afc7ef89c0f41f2bc9/.github/workflows/phoenix/bench.sh#L12

@henryleberre

Update 3: Update 2 is incorrect and case optimization is not relevant

sbryngelson avatar May 24 '24 14:05 sbryngelson

@sbryngelson I'm pretty sure --case-optimization is enabled by default in bench.

henryleberre avatar May 24 '24 15:05 henryleberre

The logs indicate that case optimization is enabled on Phoenix for the benchmarking. There's recompilation of code in cases that I would expect to see recompilation due to case optimization.

wilfonba avatar May 24 '24 15:05 wilfonba

Nevermind, you're both right and it fails with and without case optimization on Delta (and presumably other computers).

sbryngelson avatar May 24 '24 15:05 sbryngelson