Isca icon indicating copy to clipboard operation
Isca copied to clipboard

horiz_interp_conserve_mod:no latitude index found

Open eocene opened this issue 5 years ago • 14 comments

Hi all,

I have been running Isca on a machine that has recently had new nodes installed. Before the new nodes all was fine. Now, when running I get a fatal error returned from all PEs like the one below. It's hard to report to sys admin without specific request (I suspect something wasn't done when installing the new nodes but could easily be wrong). Before I go digging into this interpolation module that is triggering the error, I just wanted to check with you whether you had seen this before and/or had an idea what might be the trigger.

Thank you very much indeed for any possible help in advance,

2019-01-28 10:23:49,905 - isca - DEBUG - FATAL from PE 0: horiz_interp_conserve_mod:no latitude index found: n,sph= 1 NaN 2019-01-28 10:23:49,905 - isca - DEBUG - 2019-01-28 10:23:49,905 - isca - DEBUG - application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

eocene avatar Jan 28 '19 07:01 eocene

Hi @eocene. I would imagine that the new nodes are using a different version of MPI, or something like that. So that I can help more, could you tell us what kind of model are you running? I.e. are you using grey radiation / RRTM / held-suarez etc? I would say that your error here is almost certainly not a problem with the interpolation routine, but is a symptom of some other problem. ( I would also advise against doing too much digging in the interpolation routine. It's very long and not that easy to read!).

sit23 avatar Jan 28 '19 17:01 sit23

Hi Stephen, thank you very much indeed for the (quick) response!

I've checked the MPI versions and apparently they are all the same. I absolutely agree about the symptom-not-cause diagnosis. For completeness, I have run various test_cases and mysteriously they all fail differently. All these have been recompiled again, and tried with various PE numbers etc. (Another thing I've tried is increasings "num_iters" in horiz_interp_conserve.F90 based on reading the code/docs, but to no avail).

I'm sure that there is something amiss/stupid/negligent on my end, but since the root cause seems to be a bit obscure at the moment any hints really would be very much appreciated! Thank you very much indeed.

axisymmetric fails exactly like my personal configuration: 201901-29 12:34:06,399 - isca - DEBUG - NOTE from PE 0: MPP_IO_SET_STACK_SIZE: stack size set to 131072. 2019-01-29 12:34:06,402 - isca - DEBUG - NOTE from PE 0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 600000. 2019-01-29 12:34:06,410 - isca - DEBUG - starting 1 OpenMP threads per MPI-task 2019-01-29 12:34:06,410 - isca - DEBUG - ATMOS MODEL DOMAIN DECOMPOSITION 2019-01-29 12:34:06,410 - isca - DEBUG - X-AXIS = 128 2019-01-29 12:34:06,411 - isca - DEBUG - Y-AXIS = 8 8 8 8 8 8 8 8 2019-01-29 12:34:06,425 - isca - DEBUG - mean surface pressure= NaN mb 2019-01-29 12:34:06,437 - isca - DEBUG - NOTE from PE 0: idealized_moist_phys: Using Frierson Quasi-Equilibrium convection scheme. 2019-01-29 12:34:06,445 - isca - DEBUG - NOTE from PE 0: interpolator_mod :sn_1.000_sst.nc is a year-independent climatology file 2019-01-29 12:34:06,446 - isca - DEBUG - 2019-01-29 12:34:06,446 - isca - DEBUG - FATAL from PE 1: horiz_interp_conserve_mod:no latitude index found: n,sph= 1 NaN 2019-01-29 12:34:06,446 - isca - DEBUG -

Held-Suarez fails on a segmentation fault: 2019-01-29 12:07:35,307 - isca - DEBUG - / 2019-01-29 12:07:35,308 - isca - DEBUG - NOTE: MPP_IO_SET_STACK_SIZE: stack size set to 131072. 2019-01-29 12:07:35,310 - isca - DEBUG - NOTE: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 600000. 2019-01-29 12:07:35,316 - isca - DEBUG - starting 1 OpenMP threads per MPI-task 2019-01-29 12:07:35,316 - isca - DEBUG - ATMOS MODEL DOMAIN DECOMPOSITION 2019-01-29 12:07:35,316 - isca - DEBUG - X-AXIS = 128 2019-01-29 12:07:35,316 - isca - DEBUG - Y-AXIS = 64 2019-01-29 12:07:35,376 - isca - DEBUG - mean surface pressure= NaN mb 2019-01-29 12:07:35,528 - isca - DEBUG - forrtl: severe (174): SIGSEGV, segmentation fault occurred 2019-01-29 12:07:35,528 - isca - DEBUG - Image PC Routine Line Source 2019-01-29 12:07:35,528 - isca - DEBUG - libintlc.so.5 00002AB0523DABF1 tbk_trace_stack_i Unknown Unknown 2019-01-29 12:07:35,528 - isca - DEBUG - libintlc.so.5 00002AB0523D8D2B tbk_string_stack_ Unknown Unknown 2019-01-29 12:07:35,528 - isca - DEBUG - libifcoremt.so.5 00002AB050A22AC2 Unknown Unknown Unknown 2019-01-29 12:07:35,528 - isca - DEBUG - libifcoremt.so.5 00002AB050A22916 tbk_stack_trace Unknown Unknown 2019-01-29 12:07:35,528 - isca - DEBUG - libifcoremt.so.5 00002AB05097BAB0 for__issue_diagno Unknown Unknown 2019-01-29 12:07:35,528 - isca - DEBUG - libifcoremt.so.5 00002AB05098D658 for__signal_handl Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - libpthread-2.17.s 00002AB0505005E0 Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 00000000006C4EEC Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 00000000006BFBA7 Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 00000000006BD426 Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 000000000045197C Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 0000000000411B40 Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 0000000000468D75 Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 0000000000907BEF Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 000000000040520E Unknown Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - libc-2.17.so 00002AB05264FC05 __libc_start_main Unknown Unknown 2019-01-29 12:07:35,529 - isca - DEBUG - held_suarez.x 0000000000405119 Unknown Unknown Unknown

and Realistic-Continents fails on 'regularize: Failure to converge' 2019-01-29 12:24:46,273 - isca - DEBUG - NOTE from PE 0: MPP_IO_SET_STACK_SIZE: stack size set to 131072. 2019-01-29 12:24:46,277 - isca - DEBUG - NOTE from PE 0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 600000. 2019-01-29 12:24:46,286 - isca - DEBUG - starting 1 OpenMP threads per MPI-task 2019-01-29 12:24:46,286 - isca - DEBUG - ATMOS MODEL DOMAIN DECOMPOSITION 2019-01-29 12:24:46,286 - isca - DEBUG - X-AXIS = 128 2019-01-29 12:24:46,287 - isca - DEBUG - Y-AXIS = 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2019-01-29 12:24:46,459 - isca - DEBUG - 2019-01-29 12:24:46,460 - isca - DEBUG - FATAL from PE 1: regularize: Failure to converge 2019-01-29 12:24:46,460 - isca - DEBUG - 2019-01-29 12:24:46,460 - isca - DEBUG - application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1 2019-01-29 12:24:46,460 - isca - DEBUG - 2019-01-29 12:24:46,460 - isca - DEBUG - FATAL from PE 2: regularize: Failure to converge 2019-01-29 12:24:46,460 - isca - DEBUG - 2019-01-29 12:24:46,460 - isca - DEBUG - application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2 2019-01-29 12:24:46,460 - isca - DEBUG -

eocene avatar Jan 29 '19 09:01 eocene

I obtained a similar error using the realistic-continents case. The result being a crash with message :"regularize: Failure to converge".

AlexAudette avatar Mar 10 '20 02:03 AlexAudette

Strange - @eocene did you end up getting a handle on this problem?

sit23 avatar Mar 10 '20 09:03 sit23

@AlexAudette - do you also find other test cases to be failing, or is it just the realistic continents one?

sit23 avatar Mar 10 '20 09:03 sit23

@sit23 So far it is only with the realistic continents. I am able to run my simulation at T42 using the era_land_T42.nc land mask file, but when I create my own at T85, I get the same error as eocene.

AlexAudette avatar Mar 10 '20 13:03 AlexAudette

@AlexAudette OK - that's a slightly different problem, which we have encountered ourselves. The background is that when you put data like topography into the spectral dynamical core, the spikiness of the data and the finite number of Fourier modes means that you form Gibbs ripples etc in the topography. To help counter this, the model automatically smooths the incoming topography, which reduces the size of the ripples. The degree of the smoothing is controlled by the parameter ocean_topog_smoothing in the spectral_dynamics_nml. The parameter represents a measure of the smoothness of the topography, with higher values meaning smoother topography, and a smoothing method is applied recursively until the incoming topography is as smooth as the parameter dictates. When you change resolution though, it's possible that the smoothing algorithm cannot smooth the topography enough for it to be smoother than the parameter dictates, and so it fails to converge, as per the error message you have. To sort this out, you can reduce the ocean_topog_smoothing parameter. That way you should find that the regularisation converges, and the model will stop giving you that error message.

sit23 avatar Mar 10 '20 13:03 sit23

@sit23 Thanks for you answer. I tried reducing the ocean_topog_smoothing parameter from 0.8 to 0.05 by increments of 0.2, but still no success, I still get the same error :

2020-03-10 09:28:42,216 - isca - INFO - process running as 110162 2020-03-10 09:28:42,386 - isca - DEBUG - loadmodules for niagara machines 2020-03-10 09:28:42,470 - isca - DEBUG - The following modules were not unloaded: 2020-03-10 09:28:42,470 - isca - DEBUG - (Use "module --force purge" to unload all): 2020-03-10 09:28:42,470 - isca - DEBUG - 2020-03-10 09:28:42,470 - isca - DEBUG - 1) NiaEnv/2018a 2020-03-10 09:28:43,401 - isca - DEBUG - NOTE from PE 0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 32768. 2020-03-10 09:28:43,401 - isca - DEBUG - &MPP_IO_NML 2020-03-10 09:28:43,401 - isca - DEBUG - HEADER_BUFFER_VAL = 16384, 2020-03-10 09:28:43,401 - isca - DEBUG - GLOBAL_FIELD_ON_ROOT_PE = T, 2020-03-10 09:28:43,401 - isca - DEBUG - IO_CLOCKS_ON = F, 2020-03-10 09:28:43,401 - isca - DEBUG - SHUFFLE = 0, 2020-03-10 09:28:43,401 - isca - DEBUG - DEFLATE_LEVEL = -1 2020-03-10 09:28:43,401 - isca - DEBUG - / 2020-03-10 09:28:43,405 - isca - DEBUG - NOTE from PE 0: MPP_IO_SET_STACK_SIZE: stack size set to 131072. 2020-03-10 09:28:43,407 - isca - DEBUG - NOTE from PE 0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 600000. 2020-03-10 09:28:43,411 - isca - DEBUG - starting 1 OpenMP threads per MPI-task 2020-03-10 09:28:43,412 - isca - DEBUG - ATMOS MODEL DOMAIN DECOMPOSITION 2020-03-10 09:28:43,412 - isca - DEBUG - X-AXIS = 256 2020-03-10 09:28:43,412 - isca - DEBUG - Y-AXIS = 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 2020-03-10 09:28:43,910 - isca - DEBUG - 2020-03-10 09:28:43,911 - isca - DEBUG - FATAL from PE 15: regularize: Failure to converge 2020-03-10 09:28:43,911 - isca - DEBUG - ... 2020-03-10 09:28:43,912 - isca - DEBUG - 2020-03-10 09:28:43,912 - isca - DEBUG - FATAL from PE 0: regularize: Failure to converge 2020-03-10 09:28:43,912 - isca - DEBUG - 2020-03-10 09:28:43,912 - isca - DEBUG - -------------------------------------------------------------------------- 2020-03-10 09:28:43,912 - isca - DEBUG - MPI_ABORT was invoked on rank 14 in communicator MPI_COMM_WORLD 2020-03-10 09:28:43,912 - isca - DEBUG - with errorcode 1. 2020-03-10 09:28:43,912 - isca - DEBUG - 2020-03-10 09:28:43,912 - isca - DEBUG - NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. 2020-03-10 09:28:43,912 - isca - DEBUG - You may or may not see output from other processes, depending on 2020-03-10 09:28:43,913 - isca - DEBUG - exactly when Open MPI kills them. 2020-03-10 09:28:43,913 - isca - DEBUG - --------------------------------------------------------------------------

AlexAudette avatar Mar 10 '20 13:03 AlexAudette

OK - could you try setting it to 0? That should turn off the regularisation, and we can see if it runs then or not. You could also try increasing the parameter, just in case I've mis-remembered the way you need to go!

sit23 avatar Mar 10 '20 13:03 sit23

So it runs now with the parameter set to 0, thank you very much. I tried as well to increase the parameter to 0.96 and still crashed at the place. I will keep an eye out for truncation effects. Thanks again!

AlexAudette avatar Mar 10 '20 13:03 AlexAudette

OK - you will probably find that the gibbs ripples are significant without any smoothing. You'll particularly see it in the vertical velocity and the precipitation. When we've run with topography at T85, we have managed to run the smoothing, but I can't quite lay my hands on the smoothing parameter we used. I'll let you know if I find it. We are working on alternatives to this smoothing algorithm, which should be available soon.

sit23 avatar Mar 10 '20 13:03 sit23

Just found it - looks like I tried 0.85 for the smoothing parameter and it worked with T85 topography.

sit23 avatar Mar 10 '20 13:03 sit23

Interesting, I just tried with this same value and it still fails to regularize. Did you do anything special with your topography file?

AlexAudette avatar Mar 10 '20 14:03 AlexAudette

Well, you're welcome to try the T85 topography file that I used and see if it works for you. You can find it here: https://drive.google.com/file/d/1lsYsVE1pIDxOC_CV4SDJu8oxUxmgQ0za/view?usp=sharing

sit23 avatar Mar 10 '20 14:03 sit23