Castro icon indicating copy to clipboard operation
Castro copied to clipboard

RadBlastWave 3D CPU run crashes with NaNs in density

Open BenWibking opened this issue 1 year ago • 7 comments

I've built the RadBlastWave example on CPU with HYPRE 2.25.0. When running it with:

mpirun ./Castro3d.gnu.MPI.ex inputs.3d

it crashes with, e.g.:

WARNING: (rho e)_l < 0 or pl < small_pres in Riemann: -11512939.84 415859.1253 1e-100
WARNING: (rho e)_l < 0 or pl < small_pres in Riemann: -309777.1159 415727.3586 1e-100
WARNING: (rho e)_l < 0 or pl < small_pres in Riemann: -920856.9909 415736.9506 1e-100
WARNING: (rho e)_l < 0 or pl < small_pres in Riemann: -309775.8813 415726.8159 1e-100
Radiation f-space advection on level 2 takes as many as 1 substep.
... Leaving construct_ctu_hydro_source()

Castro::construct_ctu_hydro_source() time = 0.007451716

amrex::amrex::Abort::23::State has NaNs in the density component::check_for_nan() !!!
amrex::Abort::30::State has NaNs in the density component::check_for_nan() !!!
amrex::Abort::38::State has NaNs in the density component::check_for_nan() !!!
amrex::Abort::43::State has NaNs in the density component::check_for_nan() !!!
amrex::Abort::47::State has NaNs in the density component::check_for_nan() !!!
SIGABRT
SIGABRT
SIGABRT
SIGABRT
SIGABRT
Abort::6::State has NaNs in the density component::check_for_nan() !!!
SIGABRT
See Backtrace.23 file for details
See Backtrace.30 file for details
See Backtrace.6 file for details
See Backtrace.43 file for details
See Backtrace.47 file for details
See Backtrace.38 file for details

BenWibking avatar Aug 07 '22 23:08 BenWibking

how many steps does it take to reach this point?

zingale avatar Aug 08 '22 00:08 zingale

It fails on coarse step 2. Here is the log file: run_3d.log

BenWibking avatar Aug 08 '22 01:08 BenWibking

interesting. I just ran for 50 steps (in debug mode with amrex.fpe_trap_invalid=1 without issue. So I am not sure what's happening here.

zingale avatar Aug 08 '22 16:08 zingale

That's odd. Is there any clue in the jobinfo file? job_info.txt

BenWibking avatar Aug 09 '22 01:08 BenWibking

Ok, I've rerun with the same options and I get the same crash at coarse step 2, but with an extra Erroneous arithmetic operation message:

[Level 2 step 10] ADVANCE with dt = 0.9963133427

  Beginning subcycle 1 starting at time 8.40653503 with dt = 0.9963133427
  Estimated number of subcycles remaining: 1

...estimated hydro-limited timestep at level 2: 89.19856856
Castro::estTimeStep (hydro-limited) at level 2:  estdt = 89.19856856

... Entering construct_ctu_hydro_source()

Erroneous arithmetic operation
Erroneous arithmetic operation
Erroneous arithmetic operation
Erroneous arithmetic operation
Erroneous arithmetic operation
Erroneous arithmetic operation
WARNING: (rho e)_l < 0 or pl < small_pres in Riemann: -7888100.054 416074.0202 1e-100
WARNING: (rho e)_l < 0 or pl < small_pres in Riemann: -9355602.101 416915.6965 1e-100
WARNING: (rho e)_l < 0 or pl < small_pres in Riemann: -9363383.092 416906.7205 1e-100

etc.

BenWibking avatar Aug 09 '22 01:08 BenWibking

Here's the full log and jobinfo: run_3d_debug_trap.log job_info.txt

BenWibking avatar Aug 09 '22 01:08 BenWibking

This is because the radiation energy is negative on level 2. (So the NaN is coming from when we do something like sqrt(q(QRAD) in ctoprim.) The negative radiation energy is coming from the Hypre solve, so it will presumably require some investigation to figure out why the algorithm is getting into this state.

maxpkatz avatar Aug 09 '22 23:08 maxpkatz

This seems to be working now with hypre 2.26. It might be a good idea to avoid 2.25.

BenWibking avatar Nov 04 '22 00:11 BenWibking