NorESM2 integrity checks
Potentially linked to issue #55
This issue attempts to pull together various problems experienced so far with NorESM2 integrations:
- occasional crashes due to NaN's generated during execution: machine dependent, difficult to reproduce, uncertain cause; possibly more frequent/likely with the NF2000climo compset; rare or absent in CMIP6 compsets; possible origin in the oslo chemistry solver; more generally uninitialised non-local arrays also possible cause
- crashes and non-reproducibility (bfb) in CMIP6 compsets when not using "frc2" emission files; possible race condition in open/close statements using the same unit number when reading forcing files
- uncertainty on bfb reproducibility of CAM-Nor runs under different PE counts
Proposed tests (all to be run without "airbag" mods):
- Thomas: follow up on NF2000climo problems from tetralith users, collect casistics; test bfb reproducibility of NFHISTfsstfrc2 case with different PE count on tetralith
- Øyvind: test F2000climo (CESM2) and NF2000climo (NorESM2) runs with different PE counts on vilje for crashes and for bfb reproducibility
- Alok: run long (~100 years) NF2000climo integrations on vilje and fram checking for crashes
- Ada: test NF2000climo run on Nebula
- Dirk: test two parallel, identical N1850 cases on fram to check for divergence (~200 years); using older, non-"frc2" input file set (it is assumed that "frc2" N1850 cases are reproducible bfb based on previous tests)
Dirk will also set up an explicit "zero forcing" volcanic case for Jean Iaquinta hoping that this will be useful for his current experiment.
Dear Ada, et alii
attached the two only necessary "airbag" sourcemods.
Ada: can you please send me the Macro.make you use on Nebula? I would like to compare it with mine on tetralith.
Cheers Thomas
On 2020-05-14 09:53, Ada Gjermundsen wrote:
Dear Thomas, My nebula test crashed after 3 minutes. please see git report: https://github.com/NorESMhub/NorESM/issues/79
I would like to test a NF2000climo case on nebula with your airbag. Can you please send me the source mods? I can probably find a branch where you have submitted the code, but then I'm worried there are other changes as well, so to be on the safe side I would like to add them as source mods or user namelist settings. I hope it's not too much trouble for you.
Best, Ada
Thanks Thomas! Please find the Macros.make file attached.
Best, Ada
tor. 14. mai 2020 kl. 10:55 skrev Thomas Toniazzo [email protected]:
Dear Ada, et alii
attached the two only necessary "airbag" sourcemods.
Ada: can you please send me the Macro.make you use on Nebula? I would like to compare it with mine on tetralith.
Cheers Thomas
On 2020-05-14 09:53, Ada Gjermundsen wrote:
Dear Thomas, My nebula test crashed after 3 minutes. please see git report: https://github.com/NorESMhub/NorESM/issues/79
I would like to test a NF2000climo case on nebula with your airbag. Can you please send me the source mods? I can probably find a branch where you have submitted the code, but then I'm worried there are other changes as well, so to be on the safe side I would like to add them as source mods or user namelist settings. I hope it's not too much trouble for you.
Best, Ada
Thank you, Ada.
Tetralith's been put offline due to a security scare, so I can't check my own Marco.make; from memory I can't see a difference in flags, except perhaps a -check uninit also under ifeq ($(DEBUG),FALSE) -- but I can't be sure.
However I've got a Makefile from an e-mail exchange with a user on tetralith. She was using these additional flags both for CFLAGS and for FFLAGS: -xCORE-AVX2 -fPIC -mcmodel=large -no-fma
Could you try these flags, without airbag, and see if they make a difference?
On 2020-05-14 10:58, Ada Gjermundsen wrote:
Thanks Thomas! Please find the Macros.make file attached.
Best, Ada
tor. 14. mai 2020 kl. 10:55 skrev Thomas Toniazzo <[email protected] mailto:[email protected]>:
Dear Ada, et alii attached the two only necessary "airbag" sourcemods. Ada: can you please send me the Macro.make you use on Nebula? I would like to compare it with mine on tetralith. Cheers Thomas On 2020-05-14 09:53, Ada Gjermundsen wrote: > Dear Thomas, > My nebula test crashed after 3 minutes. please see git report: > https://github.com/NorESMhub/NorESM/issues/79 > > I would like to test a NF2000climo case on nebula with your airbag. > Can you please send me the source mods? I can probably find a branch > where you have submitted the code, but then I'm worried there are > other changes as well, so to be on the safe side I would like to add > them as source mods or user namelist settings. I hope it's not too > much trouble for you. > > Best, > Ada
Ok, I will check.
Ada
tor. 14. mai 2020 kl. 11:13 skrev Thomas Toniazzo [email protected]:
Thank you, Ada.
Tetralith's been put offline due to a security scare, so I can't check my own Marco.make; from memory I can't see a difference in flags, except perhaps a -check uninit also under ifeq ($(DEBUG),FALSE) -- but I can't be sure.
However I've got a Makefile from an e-mail exchange with a user on tetralith. She was using these additional flags both for CFLAGS and for FFLAGS: -xCORE-AVX2 -fPIC -mcmodel=large -no-fma
Could you try these flags, without airbag, and see if they make a difference?
On 2020-05-14 10:58, Ada Gjermundsen wrote:
Thanks Thomas! Please find the Macros.make file attached.
Best, Ada
tor. 14. mai 2020 kl. 10:55 skrev Thomas Toniazzo <[email protected]
:
Dear Ada, et alii
attached the two only necessary "airbag" sourcemods.
Ada: can you please send me the Macro.make you use on Nebula? I would like to compare it with mine on tetralith.
Cheers Thomas
On 2020-05-14 09:53, Ada Gjermundsen wrote:
Dear Thomas, My nebula test crashed after 3 minutes. please see git report: https://github.com/NorESMhub/NorESM/issues/79
I would like to test a NF2000climo case on nebula with your airbag. Can you please send me the source mods? I can probably find a branch where you have submitted the code, but then I'm worried there are other changes as well, so to be on the safe side I would like to add them as source mods or user namelist settings. I hope it's not too much trouble for you.
Best, Ada
Hi Ada. Thanks for all this, but I find it a bit hard to follow -- lots of details, not all of it useful, while seeming to lack some important ones e.g. where are you using the airbag mods and where not, what machine and pe count. It's not even clear to me at the moment how many experiments you ran. So: can you please summarise? listing the test you ran one by one: compsets, compiler options if non standard, and sourcemods, on which machine, crash (not due to time limit) / no crash (or time limit), with/without error message, if yes what error (e.g. NaN), how long walltime and model time before stop/crash. For tests that crashed (not due to time limit), it might be better to send us the (entire) relevant logs as attachments in e-mail instead of quoting it here. (Shame gitbhub does not allow attachments, apparently). Thanks.
Dear Ada
thanks. Did, or can you try this with airbag? (the NFHISTnorbc_f19_f19_20200514 test)
On 2020-05-14 15:47, adagj wrote:
Test of NFHISTnorbc
casename: NFHISTnorbc_f19_f19_20200514
compset: NFHISTnorbc
compset longname: HIST_CAM60%NORESM%NORBC_CLM50%BGC-CROP_CICE%PRES_DOCN%DOM_MOSART_SGLC_SWAV
git commit: git branch:* (detached from cime5.6.10_cesm2_1_rel_06-Nor_v1.0.1) 3e5838e
error in cesm.log.1625350.200514-131944:
Reading setup_nml Reading grid_nml Reading tracer_nml Reading thermo_nml Reading dynamics_nml Reading shortwave_nml Reading ponds_nml Reading snowphys_nml Reading forcing_nml Reading zbgc_nml MCT::m_Router::initp_: GSMap indices not increasing...Will correct MCT::m_Router::initp_: RGSMap indices not increasing...Will correct MCT::m_Router::initp_: RGSMap indices not increasing...Will correct MCT::m_Router::initp_: GSMap indices not increasing...Will correct ERROR: component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet 1d global index: 3009 ERROR: component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphidry 1d global index: 3879 ERROR:
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/NorESMhub/NorESM/issues/79#issuecomment-628647463, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZGLJEXWSMVUZ6C5IHXUVLRRPY65ANCNFSM4M73JXXQ.
Sure, I can try with airbag. The only simulation I ran with airbag was a NF2000climo case and that was running fine. The NF2000climo case without airbag crashed with NaNs. I'll try to make a new summary, but maybe it is easier if I'll wait for some of you to report your findings and I can make a similar summary.
Hi Ada. I'm not sure I understand what you want to wait for. Your summary would be useful for others who may or may not be trying the same experiments. At the moment it's just too hard to make head or tails of what you've put on github so far.
On 2020-05-15 09:22, adagj wrote:
Sure, I can try with airbag. The only simulation I ran with airbag was a NF2000climo case and that was running fine. The NF2000climo case without airbag crashed with NaNs. I'll try to make a new summary, but maybe it is easier if I'll wait for some of you to report your findings and I can make a similar summary.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/NorESMhub/NorESM/issues/79#issuecomment-629075392, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZGLJH2ZDZELCMJAIH74RDRRTUULANCNFSM4M73JXXQ.
Hi, I have conducted several test using NF compsets, all on nebula, all using f19 resolution, all using 8 nodes. Generally the simulations run fine when I use airbag or when I use frc2 emission files. When I run without airbag or frc2, they crash. I have copied all the case folders and all the run folders (may take some time before completed) to vilje so you can look for whatever useful information you need.
The experiments are located here: /home/ntnu/adagj/nebula_test_exp/ I hope you all have access.
Good luck on further testing.
Hi, Here is update from me:
- Compset :- NF2000climo; grid :- f19_f19_mg17 ; No Airbag; Machines - Vilje and Fram
It working right now well on both FRAM and VILJE; on FRAM 1-3 months run I executed several times and Not a single Crash; I submitted then long run on both machines; completed around 10 years and still running not a single crash. Both are 100 years run so, it will take time to complete.
- I have checked the code and I am still continuing it with the help of Dirk. Rightnow, I did not find where these emissions files are closed. Second, I need to understand these interpolation schemes; I will check these.
Should I update it on github or create googledoc?
Also, I wanted to know that if anyone remember when these model used to crash? Just after restart in middle of month ? or is it completely random?
sincerely, Alok
[cid:norce_426cc920-64ab-4893-95d0-c29d5dab670c.png]
NORCE Norwegian Research Centre AS norceresearch.nohttps://www.norceresearch.no/
From: Ada Gjermundsen [email protected] Sent: Thursday, May 14, 2020 11:14 AM To: Thomas Toniazzo [email protected] Cc: oyvindse [email protected]; [email protected] [email protected]; Alok Kumar Gupta [email protected]; [email protected] [email protected]; [email protected] [email protected] Subject: Re: Airbag sourcemods
Ok, I will check.
Ada
tor. 14. mai 2020 kl. 11:13 skrev Thomas Toniazzo <[email protected]mailto:[email protected]>: Thank you, Ada.
Tetralith's been put offline due to a security scare, so I can't check my own Marco.make; from memory I can't see a difference in flags, except perhaps a -check uninit also under ifeq ($(DEBUG),FALSE) -- but I can't be sure.
However I've got a Makefile from an e-mail exchange with a user on tetralith. She was using these additional flags both for CFLAGS and for FFLAGS: -xCORE-AVX2 -fPIC -mcmodel=large -no-fma
Could you try these flags, without airbag, and see if they make a difference?
On 2020-05-14 10:58, Ada Gjermundsen wrote: Thanks Thomas! Please find the Macros.make file attached.
Best, Ada
tor. 14. mai 2020 kl. 10:55 skrev Thomas Toniazzo <[email protected]mailto:[email protected]>: Dear Ada, et alii
attached the two only necessary "airbag" sourcemods.
Ada: can you please send me the Macro.make you use on Nebula? I would like to compare it with mine on tetralith.
Cheers Thomas
On 2020-05-14 09:53, Ada Gjermundsen wrote:
Dear Thomas, My nebula test crashed after 3 minutes. please see git report: https://github.com/NorESMhub/NorESM/issues/79
I would like to test a NF2000climo case on nebula with your airbag. Can you please send me the source mods? I can probably find a branch where you have submitted the code, but then I'm worried there are other changes as well, so to be on the safe side I would like to add them as source mods or user namelist settings. I hope it's not too much trouble for you.
Best, Ada
Hi Alok thanks for this. I'm looking at my tests also (identical cases) on tetralith and on fram. All short runs seem to have finished without errors also. Reproducibility is sensitive to compiler options (perhaps unsurprisingly), and not sensitive to the airbag. It seems so far only Ada succeeded in reproducing the earlier crashes. Could you check for reproducibility among your shorter runs? I do wonder how efficient it is to run long integrations Your work on the code with Dirk is clearly very important -- we should try and understand it. Perhaps a google doc would be clearer that the slight mess that comes out of github... I'll update you (all) later. Thanks again & best regards Thomas
On 2020-05-19 09:44, Alok Kumar Gupta wrote:
Hi, Here is update from me:
Compset :- NF2000climo; grid :- f19_f19_mg17 ; No Airbag; Machines - Vilje and FramIt working right now well on both FRAM and VILJE; on FRAM 1-3 months run I executed several times and Not a single Crash; I submitted then long run on both machines; completed around 10 years and still running not a single crash. Both are 100 years run so, it will take time to complete.
I have checked the code and I am still continuing it with the help of Dirk. Rightnow, I did not find where these emissions files are closed. Second, I need to understand these interpolation schemes; I will check these.Should I update it on github or create googledoc?
Also, I wanted to know that if anyone remember when these model used to crash? Just after restart in middle of month ? or is it completely random?
sincerely, Alok
NORCE Norwegian Research Centre AS norceresearch.no https://www.norceresearch.no/
From: Ada Gjermundsen [email protected] Sent: Thursday, May 14, 2020 11:14 AM To: Thomas Toniazzo [email protected] Cc: oyvindse [email protected]; [email protected] [email protected]; Alok Kumar Gupta [email protected]; [email protected] [email protected]; [email protected] [email protected] Subject: Re: Airbag sourcemods Ok, I will check.
Ada
tor. 14. mai 2020 kl. 11:13 skrev Thomas Toniazzo <[email protected] mailto:[email protected]>:
Thank you, Ada. Tetralith's been put offline due to a security scare, so I can't check my own Marco.make; from memory I can't see a difference in flags, except perhaps a -check uninit also under ifeq ($(DEBUG),FALSE) -- but I can't be sure. However I've got a Makefile from an e-mail exchange with a user on tetralith. She was using these additional flags both for CFLAGS and for FFLAGS: -xCORE-AVX2 -fPIC -mcmodel=large -no-fma Could you try these flags, without airbag, and see if they make a difference? On 2020-05-14 10:58, Ada Gjermundsen wrote:Thanks Thomas! Please find the Macros.make file attached. Best, Ada tor. 14. mai 2020 kl. 10:55 skrev Thomas Toniazzo <[email protected] <mailto:[email protected]>>: Dear Ada, et alii attached the two only necessary "airbag" sourcemods. Ada: can you please send me the Macro.make you use on Nebula? I would like to compare it with mine on tetralith. Cheers Thomas On 2020-05-14 09:53, Ada Gjermundsen wrote: > Dear Thomas, > My nebula test crashed after 3 minutes. please see git report: > https://github.com/NorESMhub/NorESM/issues/79 <https://github.com/NorESMhub/NorESM/issues/79> > > I would like to test a NF2000climo case on nebula with your airbag. > Can you please send me the source mods? I can probably find a branch > where you have submitted the code, but then I'm worried there are > other changes as well, so to be on the safe side I would like to add > them as source mods or user namelist settings. I hope it's not too > much trouble for you. > > Best, > Ada
@monsieuralok It seems to me that results are already different after a single time step, which could be linked to an initialization problem.
ah -sorry- to yor question: completely random, in my experience.
On 2020-05-19 09:44, Alok Kumar Gupta wrote:
Hi, Here is update from me:
Compset :- NF2000climo; grid :- f19_f19_mg17 ; No Airbag; Machines - Vilje and FramIt working right now well on both FRAM and VILJE; on FRAM 1-3 months run I executed several times and Not a single Crash; I submitted then long run on both machines; completed around 10 years and still running not a single crash. Both are 100 years run so, it will take time to complete.
I have checked the code and I am still continuing it with the help of Dirk. Rightnow, I did not find where these emissions files are closed. Second, I need to understand these interpolation schemes; I will check these.Should I update it on github or create googledoc?
Also, I wanted to know that if anyone remember when these model used to crash? Just after restart in middle of month ? or is it completely random?
sincerely, Alok
NORCE Norwegian Research Centre AS norceresearch.no https://www.norceresearch.no/
From: Ada Gjermundsen [email protected] Sent: Thursday, May 14, 2020 11:14 AM To: Thomas Toniazzo [email protected] Cc: oyvindse [email protected]; [email protected] [email protected]; Alok Kumar Gupta [email protected]; [email protected] [email protected]; [email protected] [email protected] Subject: Re: Airbag sourcemods Ok, I will check.
Ada
tor. 14. mai 2020 kl. 11:13 skrev Thomas Toniazzo <[email protected] mailto:[email protected]>:
Thank you, Ada. Tetralith's been put offline due to a security scare, so I can't check my own Marco.make; from memory I can't see a difference in flags, except perhaps a -check uninit also under ifeq ($(DEBUG),FALSE) -- but I can't be sure. However I've got a Makefile from an e-mail exchange with a user on tetralith. She was using these additional flags both for CFLAGS and for FFLAGS: -xCORE-AVX2 -fPIC -mcmodel=large -no-fma Could you try these flags, without airbag, and see if they make a difference? On 2020-05-14 10:58, Ada Gjermundsen wrote:Thanks Thomas! Please find the Macros.make file attached. Best, Ada tor. 14. mai 2020 kl. 10:55 skrev Thomas Toniazzo <[email protected] <mailto:[email protected]>>: Dear Ada, et alii attached the two only necessary "airbag" sourcemods. Ada: can you please send me the Macro.make you use on Nebula? I would like to compare it with mine on tetralith. Cheers Thomas On 2020-05-14 09:53, Ada Gjermundsen wrote: > Dear Thomas, > My nebula test crashed after 3 minutes. please see git report: > https://github.com/NorESMhub/NorESM/issues/79 <https://github.com/NorESMhub/NorESM/issues/79> > > I would like to test a NF2000climo case on nebula with your airbag. > Can you please send me the source mods? I can probably find a branch > where you have submitted the code, but then I'm worried there are > other changes as well, so to be on the safe side I would like to add > them as source mods or user namelist settings. I hope it's not too > much trouble for you. > > Best, > Ada
@monsieuralok It seems to me that results are already different after a single time step, which could be linked to an initialization problem.
Hi Jan (@j34ni ), I will check for reproducibility on both machines and update. It should be related to initialization of some variables for sure.
A little summary from me up to this point. My tests on tetralith, NFPD1.8a and NFPD1.8d, so far indicate that, irrespective of compiler flag or airbag sourcemods, Alok's set of sourcemods where variables are consistently initialised prevent crashes.
I would suggest to everyone to run their tests with these sourcemods from now on. I attach my own version below. (Alok might have additions/updates.) SourceMods_Alok_Init.tar.gz
If with these we can collectively run 50 years or so with NF2000climo, reproducibly and without crashes, I think we can then assume that this issue is resolved. For these tests we should build, on all machines, with (intel) compiler flags -xAVX -no-fma (for reproducibility), and without using the flag -init=zero,arrays (to test for possible further initialisation problems).
Irrespective of further tests, Alok will soon send a pull request with the sourcemods and with an updated config_compilers.xml to NorESMhub CAM branch cam_cesm2_1_rel_05-Nor. Dirk and Øyvind could you please revise this request -- I basically already have, I think. Once this is pushed and the statbility/reproducibility tests done, finally any eventual airbag residues should be removed.
Some information from test with older model version (summer 2019) with N1850 compset on fram :
- continued piControl (N1850_f19_tn14_20190802) after 2100 : it run for 114 years and then stopped in 2215. The crash was not a mid-month crash.
- started a branch run from simulation above at 2101-01-01 (N1850_f19_tn14_20200513) . The results were bit-identical during almost 6 years, but diverged in 2106-11-16 : so it was a mid-month divergence.
Hi @DirkOlivie , do you know at what time your model run crashed? I experienced that all my simulations suddenly disappeared last Thursday at 1pm. I resubmitted and they crashed again later on due to I/O problems. It turned out that they were replacing hardware on Fram (it was not mentioned in the ops log):
"We did some hardware replacement on the Fram DDN equipment yesterday, starting around 13:00. This was not supposed to disturb disk operations other than small hang a couple of minutes, but apparently some users have noticed. So that was the explanation. Current status on Fram DDN is that HW was replaced, but we are currently not back in full production with all file servers. We are missing one of eight. We continue to investigate this and further disturbance might occur when we try to put this online again."
Ada
Hi @adagj it crashed also on Thursday (June 4th) at 1pm - so must be related to the issue you mention. Dirk
Hi @DirkOlivie , Have you executed these model with the SourceMods that Thomas has mentioned or are they without SourceMods? We have to think about tests what should be performed for Crash and reproducibility?
Hi @monsieuralok, these tests were done without Thomas SourceMods. I think the aim was to see whether the changes on the filesystem on fram (done this spring) would have changed the model behaviour. Apparently it hasn't - I find the same type of divergence as before.
As a test, I can use the new code structure + sourceMods and start two N1850 simulations from 2101-01-01 onwards. The aim would be to see whether :
- the new code initially behaves as the old code (same results for the few first months);
- the two new simulations will or will not diverge.
Hi Dirk, I would suggest to add a flag FFLAGS := $(FFLAGS) -init=zero,arrays in Macros.make as I have never tested for coupled simulation in DEBUG mode. I only tested NF2000Climo. First, I was having a mid month crash at year 30. After these fixes I did not got any crash in two 100 years simulation. I will also try yours mentioned test in DEBUG mode.
Since it may be harder to find NO divergence and NO crashes than to find some, I would suggest to use this block in Macros.make
ifeq ($(MODEL),cice) FFLAGS := $(FFLAGS) -init=zero,arrays endif
i.e. limit init to CICE only.
I doubt there are uninitialised arrays in BLOM (Mats has checked this extensively in stand-alone cases).
The divergence might not be linked with initialisation, although at least some of the crashes do seem to be.
On 2020-06-08 13:03, monsieuralok wrote:
Hi Dirk, I would suggest to add a flag FFLAGS := $(FFLAGS) -init=zero,arrays in Macros.make as I have never tested for coupled simulation in DEBUG mode. I only tested NF2000Climo. First, I was having a mid month crash at year 30. After these fixes I did not got any crash in two 100 years simulation. I will also try yours mentioned test in DEBUG mode.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/NorESMhub/NorESM/issues/79#issuecomment-640533836, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZGLJGVZHIQV5Y4QCEASWDRVTAPXANCNFSM4M73JXXQ.
@tto061 @monsieuralok @DirkOlivie Can we close this issue? Or should we leave it as it is for now?