DART icon indicating copy to clipboard operation
DART copied to clipboard

bug: fix bounds fix_bound_violations = .true. seems to be required for ifort

Open hkershaw-brown opened this issue 6 months ago • 3 comments

:bug: Your bug may already be reported! Please search on the issue tracker before creating a new issue.

Describe the bug

  1. List the steps someone needs to take to reproduce the bug.

/glade/derecho/scratch/hkershaw/DART/Bugs/bgunn_qceff/DART/models/lorenz_96_tracer_advection/work Following https://github.com/NCAR/DART/blob/l96_tracer_tests/models/lorenz_96_tracer_advection/work/TESTS/TEST_DRIVER.csh reported by Ben Gunn: (thanks @Benjamin-Gunn !) https://github.com/Benjamin-Gunn/DART/blob/l96_tracer_tests/models/lorenz_96_tracer_advection/work/TESTS/TEST_DRIVER.csh

qceff_table_filename = 'one_below_qceff_table.csv'

&filter_nml inf_flavor = 5, 5,

&model_nml model_size = 120, forcing = 8.0, delta_t = 0.05, mean_velocity = 0.0, pert_velocity_multiplier = 5.0, diffusion_coef = 0.0, e_folding = 0.25, sink_rate = 0.1, source_rate = 100.0, point_tracer_source_rate = 5.0, positive_tracer = .false., bound_above_is_one = .true., time_step_days = 0, time_step_seconds = 3600, /

  1. What was the expected outcome? not expected fix_bound_violations = .true. to be required so often.

  2. What actually happened?
    Failures for "Ensemble member greater than upper bound first check" at various pe counts.

You can set:

&probit_transform_nml fix_bound_violations = .true. /

however, you still get different answers across mpi counts.

#!/bin/bash

module load nco

rm -f one_var_temp.nc
ncrcat -d location,1,1 filter_output.nc one_var_temp.nc
ncks -V -C -v state_variable_mean one_var_temp.nc | tail -3 | head -1 >> test_output
rm -f  one_var_temp.nc

varying pe count: 7.95979093017264 ; 8.02126025256388 ; 8.55748257662756 ;

varying pe count with -fp-model-precise 8.62082489125036 ; 8.62082489125036 ; 8.62082489125036 ;

not sure how different is ok with the varying pe count. Note: I cannot reproduce the bounds violations with -fp-model-precise

Todo @hkershaw intel/2024.0.2, ifx, vs gfortran

Error Message

3 mpi tasks: (also happens with 8,7 (without post_inf), 40(without post_inf))

 PE 0: comp_cov_factor: Standard Gaspari Cohn localization selected
 ERROR FROM:
  source : bnrh_distribution_mod.f90
  routine: bnrh_cdf_initialized
  message:  Ensemble member greater than upper bound first check(see code)   1.00000000000000        1.00000000000000
 
MPICH ERROR [Rank 0] [job id e35a8d7d-258f-45c5-8d80-ba05433b0be5] [Tue Aug  6 12:24:05 2024] [dec0508] - Abort(99) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 99) - process 0

 ERROR FROM:
  source : bnrh_distribution_mod.f90
  routine: bnrh_cdf_initialized
  message:  Ensemble member greater than upper bound first check(see code)   1.00000000000000        1.00000000000000
 
MPICH ERROR [Rank 1] [job id e35a8d7d-258f-45c5-8d80-ba05433b0be5] [Tue Aug  6 12:24:05 2024] [dec0508] - Abort(99) (rank 1 in comm 496): application called MPI_Abort(comm=0x84000001, 99) - process 1

 ERROR FROM:
  source : bnrh_distribution_mod.f90
  routine: bnrh_cdf_initialized
  message:  Ensemble member greater than upper bound first check(see code)   1.00000000000000        1.00000000000000
 
MPICH ERROR [Rank 2] [job id e35a8d7d-258f-45c5-8d80-ba05433b0be5] [Tue Aug  6 12:24:05 2024] [dec0508] - Abort(99) (rank 2 in comm 496): application called MPI_Abort(comm=0x84000001, 99) - process 2

Here is the code: https://github.com/NCAR/DART/blob/75cf8dc9c566221f624ffd4d5eeba9fde5a1757c/assimilation_code/modules/assimilation/bnrh_distribution_mod.f90#L292-L300

Which model(s) are you working with?

lorenz_96_tracer advaction.

/glade/derecho/scratch/hkershaw/DART/Bugs/bgunn_qceff/DART/models/lorenz_96_tracer_advection/work

Version of DART

v11.5.1

Have you modified the DART code?

No

Build information

Please describe:

  1. Derecho
  2. ifort (IFORT) 2021.10.0 20230609

hkershaw-brown avatar Aug 06 '24 19:08 hkershaw-brown