memory leak in n1850 case
Describe the bug There is a slow growning memory leak in running the NorESM2.5 N1850 case. The med.log file shows the following memory_write: model date = 0001-01-02T00:00:00 memory = 1015.75 MB (usage) memory_write: model date = 0012-04-30T00:00:00 memory = 1802.26 MB (usage) The 1802.26 is the highwater per task - and there are 128 tasks per node - so the model is hitting a required memory of around 230GB.
Please provide a clear and concise description of what the bug is.
- NorESM version: Noresm2.5 (https://github.com/mvertens/NorESM/tree/feature/noresm2_5_alpha04_v3)
- HPC platform: betzy
- Compiler (if applicable): intel
- Compset/Resolution : see https://github.com/NorESMhub/noresm_simulations/issues/30
- Error message (if applicable):
To Reproduce Steps to reproduce the behavior: See https://github.com/NorESMhub/noresm_simulations/issues/30
I believe the memory leak is still an issue. Transferring the milestone to NorESM3.0 so as not to lose sight of this issue.
So, sorry if this is a total tangent, but I discovered for unrelated reasons that ESMF regridding can cause memory leaks if regridding objects are not manually deleted from wherever they are being called. This is a problem in the xesmf application see f.i. https://github.com/JiaweiZhuang/xESMF/issues/53 (which provides a solve for it in xesmf), but could some regridding memory build up which is not taken care of be a possible culprit for the periodic memory leak?
See also: https://github.com/esmf-org/esmf/issues/277
@adagj @maritsandstad @MichaelSchulzMETNO @oyvindseland @gold2718 @TomasTorsvik @monsieuralok - I have made progress in tracking down the memory leak. I am using the following
--compset 1850_CAM70%LT%NORESM%CAMoslo_CLM60%SP_CICE%PRES_DOCN%DOM_SROF_SGLC_SWAV_SESP
--res ne16pg3_ne16pg3_mtn14
The nice thing about this configuration is that its fast to run and there is a large memory increase right at day 15.
The memory leak occurs in mo_extfrc.F90 in the routine extfrc_timestep_init and in the call to advance_trcdata for the file
/cluster/shared/noresm/inputdata/atm/cam/chem/emis/elev/H2O_emission_CH4_oxidationx2_elev_3DmonthlyL70_1850climoCMIP6piControl001_y21-50avg_c180802.nc
I think the problem is in the 3d regridding routine. This is NOT the only file that has 3d fields - however the number of levels here are around twice what the other levels are. So I have found that even with other fields in files containing a vertical dimension there is a memory leak - and it looks like its coming from the line
ierr = pio_get_var( fid, vid, strt, cnt, wrk3d )
in the routine read_3d_trc in module tracer_data.F90
This is really perplexing to me in that a pio read should not cause the memory to increase once it has already been allocated.
It was straightforward to instrument the code with memory leak - see the following
do m = 1,n_frc_files
write(mchar,'(i0)') m
if (masterproc) then
ierr = GPTLprint_memusage('memcheck: before adv_trc for file '//trim(mchar)//'for '//trim(forcings(m)%filename))
end if
call advance_trcdata( forcings(m)%fields, forcings(m)%file, state, pbuf2d )
if (masterproc) then
ierr = GPTLprint_memusage('memcheck: after adv_trc for file '//trim(mchar)//'for '//trim(forcings(m)%filename))
end if
end do
The next step is to instrument advance_trcdata for the 3D forcing input/interpolation (not sure which one is causing the leak) to narrow this down further.
@mvertens you are the best.
@mvdebolskiy that's what I thought :)
Great job.
This 3d field is actually not directly connected to the aerosol parameterisation as such, but needed to get a good representation of water vapor in the stratosphere. @DirkOlivie You have explained it but I can not remember why we had to read in the fields while mam4 does not.
Fantastic job @mvertens !
@DirkOlivie Dirk can explain it better but the reason why this file is not used anymore in CESM3 but in CAM-Nor is that CAM7 now has prognostic equations for all main GHGs. CAM7-Nor will likely use the CESM3 formulation. https://github.com/NorESMhub/CAM/issues/188
Still both CESM2 and NorESM2 used this file, so it is surprising that the memory leak cam to view in NorESM3 and not in NorESM2.
@oyvindseland - thanks for the update. It's not only this file it turns out and the situation is totally perplexing. Here is where I am now.
- I have run my case for 1 year. The memory jump seems to occur in the following call in
read_3d_trcin moduletracer_data.F90. This is very perplexing since there is no actual code that would cause a memory leak in this routine.
if (masterproc) then
write(6,*)'memcheck: cnt = ',cnt(:)
write(6,*)'memcheck: strt= ',strt(:)
call mem_getusage('memcheck: before pio_get_var_3d')
end if
ierr = pio_get_var( fid, vid, strt, cnt, wrk3d )
if (masterproc) then
call mem_getusage('memcheck: after pio_get_var_3d')
end if
- Although there are quite a few multi- level datasets - most seem to be 30 levels or less. However, there are two 70 level datasets that seem to result in periodic spikes at mid-month. /cluster/shared/noresm/inputdata/atm/cam/chem/emis/elev/H2O_emission_CH4_oxidationx2_ele v_3DmonthlyL70_1850climoCMIP6piControl001_y21-50avg_c180802.nc
- /cluster/shared/noresm/inputdata/atm/cam/tracer_cnst/tracer_cnst_WACCM6_halons_3DmonthlyL70_1850climoCMIP6piControl001_y21-50avg_c180802.nc
- A summarized output of the instrumented code is below. These correspond to exactly the memory jumps in the med.log file.
memory_write: model date = 0001-01-15T00:00:00 memory = -0.00 MB (highwater) 768.20 MB (usage)
0: memcheck: before adv_trc for file 17 749.23 MB
0: memcheck: READ_NEXT_TRCDATA CH4_CHML
0: memcheck: before read_next flds 749.23 MB
0: memcheck: cnt = 288 192 70 1
0: memcheck: strt= 1 1 1 1
0: memcheck: before pio_get_var_3d 749.23 MB
0: memcheck: after pio_get_var_3d 768.02 MB
0: memcheck: cnt = 288 192 70 1
0: memcheck: strt= 1 1 1 2
0: memcheck: before pio_get_var_3d 768.02 MB
0: memcheck: after pio_get_var_3d 768.20 MB
0: memcheck: after read_next flds 768.20 MB
0: memcheck: after adv_trc for file 17 768.20 MB
memory_write: model date = 0001-01-17T00:00:00 memory = -0.00 MB (highwater) 770.60 MB (usage)
0: memcheck: READ_NEXT_TRCDATA O3
0: memcheck: READ_NEXT_TRCDATA OH
0: memcheck: READ_NEXT_TRCDATA NO3
0: memcheck: READ_NEXT_TRCDATA HO2
0: memcheck: before read_next flds 768.20 MB
0: memcheck: cnt = 288 192 70 1
0: memcheck: strt= 1 1 1 1
0: memcheck: before pio_get_var_3d 768.20 MB
0: memcheck: after pio_get_var_3d 770.57 MB
memory_write: model date = 0001-06-16T00:00:00 memory = -0.00 MB (highwater) 785.85 MB (usage)
0: memcheck: READ_NEXT_TRCDATA O3
0: memcheck: READ_NEXT_TRCDATA OH
0: memcheck: READ_NEXT_TRCDATA NO3
0: memcheck: READ_NEXT_TRCDATA HO2
0: memcheck: before read_next flds 770.77 MB
0: memcheck: cnt = 288 192 70 1
0: memcheck: strt= 1 1 1 6
0: memcheck: before pio_get_var_3d 770.77 MB
0: memcheck: after pio_get_var_3d 785.85 MB
memory_write: model date = 0001-08-15T00:00:00 memory = -0.00 MB (highwater) 786.36 MB (usage)
0: memcheck: before adv_trc for file 17 785.85 MB
0: memcheck: READ_NEXT_TRCDATA CH4_CHML
0: memcheck: before read_next flds 785.85 MB
0: memcheck: cnt = 288 192 70 1
0: memcheck: strt= 1 1 1 8
0: memcheck: before pio_get_var_3d 785.85 MB
0: memcheck: after pio_get_var_3d 786.20 MB
0: memcheck: cnt = 288 192 70 1
0: memcheck: strt= 1 1 1 9
0: memcheck: before pio_get_var_3d 786.20 MB
0: memcheck: after pio_get_var_3d 786.36 MB
0: memcheck: after read_next flds 786.36 MB
0: memcheck: after adv_trc for file 17 786.36 MB
memory_write: model date = 0001-09-15T00:00:00 memory = -0.00 MB (highwater) 794.36 MB (usage)
0: memcheck: before adv_trc for file 17 786.52 MB
0: memcheck: READ_NEXT_TRCDATA CH4_CHML
0: memcheck: before read_next flds 786.52 MB
0: memcheck: cnt = 288 192 70 1
0: memcheck: strt= 1 1 1 9
0: memcheck: before pio_get_var_3d 786.52 MB
0: memcheck: after pio_get_var_3d 794.17 MB
0: memcheck: cnt = 288 192 70 1
0: memcheck: strt= 1 1 1 10
0: memcheck: before pio_get_var_3d 794.17 MB
0: memcheck: after pio_get_var_3d 794.36 MB
0: memcheck: after read_next flds 794.36 MB
0: memcheck: after adv_trc for file 17 794.36 MB
memory_write: model date = 0001-09-17T00:00:00 memory = -0.00 MB (highwater) 809.85 MB (usage)
0: memcheck: READ_NEXT_TRCDATA O3
0: memcheck: READ_NEXT_TRCDATA OH
0: memcheck: READ_NEXT_TRCDATA NO3
0: memcheck: READ_NEXT_TRCDATA HO2
0: memcheck: before read_next flds 794.36 MB
0: memcheck: cnt = 288 192 70 1
0: memcheck: strt= 1 1 1 9
0: memcheck: before pio_get_var_3d 794.36 MB
0: memcheck: after pio_get_var_3d 809.85 MB
memory_write: model date = 0001-11-16T00:00:00 memory = -0.00 MB (highwater) 826.45 MB (usage)
0: memcheck: READ_NEXT_TRCDATA O3
0: memcheck: READ_NEXT_TRCDATA OH
0: memcheck: READ_NEXT_TRCDATA NO3
0: memcheck: READ_NEXT_TRCDATA HO2
0: memcheck: before read_next flds 811.03 MB
0: memcheck: cnt = 288 192 70 1
0: memcheck: strt= 1 1 1 11
0: memcheck: before pio_get_var_3d 811.03 MB
0: memcheck: after pio_get_var_3d 826.45 MB
- The above just makes no sense to me. Two dates are always read in whenever new data needs to be read - the lower and upper bound bracketing the model date. The large memory jump - when it happens - is always upon reading in the lower bound. There is often a very small memory jump upon reading in the upper bound. I have gone over this as well with @gold2718 - and he suggested that I try to rerun the one year run with no optimization - to verify that this is indeed where the memory jump is occurring.
- I'm happy to talk about this with anyone that might have any ideas as to the above.
A few more details to add to the confusion.
- I ran the same test with gnu rather than intel (in non-debug mode) and the results look different. I see a memory spikes with READ_NEXT_TRCDATA CH4_CHML - but the memory also decreases periodically - which did not happen with intel. Also - only tracer_data.F90 is compiled -O0 optimization.
memory_write: model date = 0001-12-23T00:00:00 memory = 7496.62 MB (highwater) 781.34 MB (usage)
memory_write: model date = 0001-01-14T00:00:00 memory = 7320.89 MB (highwater) 733.63 MB (usage)
memory_write: model date = 0001-01-15T00:00:00 memory = 7329.17 MB (highwater) 742.13 MB (usage) <=== same jump (READ_NEXT_TRCDATA CH4_CHML)
memory_write: model date = 0001-01-26T00:00:00 memory = 7329.17 MB (highwater) 712.72 MB (usage) <=== new drop
memory_write: model date = 0001-02-02T00:00:00 memory = 7334.42 MB (highwater) 719.40 MB (usage)
memory_write: model date = 0001-02-19T00:00:00 memory = 7340.93 MB (highwater) 724.32 MB (usage)
memory_write: model date = 0001-03-15T00:00:00 memory = 7354.96 MB (highwater) 731.57 MB (usage) (READ_NEXT_TRCDATA CH4_CHML)
memory_write: model date = 0001-04-12T00:00:00 memory = 7366.08 MB (highwater) 730.86 MB (usage)
memory_write: model date = 0001-04-19T00:00:00 memory = 7371.15 MB (highwater) 736.27 MB (usage)
memory_write: model date = 0001-05-01T00:00:00 memory = 7384.34 MB (highwater) 737.54 MB (usage)
memory_write: model date = 0001-05-14T00:00:00 memory = 7384.34 MB (highwater) 736.07 MB (usage)
memory_write: model date = 0001-05-15T00:00:00 memory = 7393.54 MB (highwater) 745.31 MB (usage)
memory_write: model date = 0001-06-01T00:00:00 memory = 7404.52 MB (highwater) 747.32 MB (usage)
memory_write: model date = 0001-06-15T00:00:00 memory = 7407.57 MB (highwater) 749.53 MB (usage)
memory_write: model date = 0001-07-01T00:00:00 memory = 7415.89 MB (highwater) 750.83 MB (usage)
memory_write: model date = 0001-07-15T00:00:00 memory = 7421.51 MB (highwater) 756.33 MB (usage) (READ_NEXT_TRCDATA CH4_CHML)
memory_write: model date = 0001-07-23T00:00:00 memory = 7421.51 MB (highwater) 755.03 MB (usage)
memory_write: model date = 0001-08-15T00:00:00 memory = 7442.37 MB (highwater) 764.17 MB (usage)
memory_write: model date = 0001-10-15T00:00:00 memory = 7472.64 MB (highwater) 773.10 MB (usage) (READ_NEXT_TRCDATA CH4_CHML)
memory_write: model date = 0001-11-14T00:00:00 memory = 7480.75 MB (highwater) 773.29 MB (usage)
memory_write: model date = 0001-11-15T00:00:00 memory = 7485.52 MB (highwater) 778.10 MB (usage) (READ_NEXT_TRCDATA CH4_CHML)
memory_write: model date = 0001-11-27T00:00:00 memory = 7485.52 MB (highwater) 777.68 MB (usage)
memory_write: model date = 0001-12-14T00:00:00 memory = 7493.12 MB (highwater) 778.72 MB (usage)
memory_write: model date = 0001-12-15T00:00:00 memory = 7496.62 MB (highwater) 782.16 MB (usage) (READ_NEXT_TRCDATA CH4_CHML)
Compiling with DEBUG mode using intel memory jumps like above still happen - just not at the same dates as with non-debug mode.
- [x] try OsloAero on derecho
- [ ] Try betzy without OsloAero
- [ ] metadata comaprison for emission files
I managed to repeat the ne16 OsloAero run on derecho.
- the memory jump for CH4_CHML occurs - but behaves different than on betzy It goes down again in the mediator log
Model Date: 0001-01-14T00:00:00 wall clock = 2025-05-09T03:39:19 avg dt = 18.44 s/day, dt = 18.36 s/day, rate = 12.89 ypd
memory_write: model date = 0001-01-14T00:00:00 memory = 1671.32 MB (highwater) 679.45 MB (usage)
Model Date: 0001-01-15T00:00:00 wall clock = 2025-05-09T03:39:38 avg dt = 18.45 s/day, dt = 18.61 s/day, rate = 12.72 ypd
memory_write: model date = 0001-01-15T00:00:00 memory = 1701.32 MB (highwater) 709.56 MB (usage)
Model Date: 0001-01-16T00:00:00 wall clock = 2025-05-09T03:39:58 avg dt = 18.57 s/day, dt = 20.25 s/day, rate = 11.69 ypd
memory_write: model date = 0001-01-16T00:00:00 memory = 1687.71 MB (highwater) 697.05 MB (usage)
And also behaves different in the instrumentation
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: filename is /glade/campaign/cesm/cesmdata/inputdata/atm/cam/chem/emis/elev/H2O_emission_CH4_oxidationx2_elev_3DmonthlyL70_1850climoCMIP6piControl001_y21-50avg_c180802.nc
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: before adv_trc for file 1 679.70 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: READ_NEXT_TRCDATA CH4_CHML
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: before read_next flds 679.70 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: cnt = 288 192 70 1
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: strt= 1 1 1 1
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: before pio_get_var_3d 681.70 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: after pio_get_var_3d 709.45 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: cnt = 288 192 70 1
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: strt= 1 1 1 2
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: before pio_get_var_3d 681.82 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: after pio_get_var_3d 709.56 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: after read_next flds 709.56 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: after adv_trc for file 1 709.56 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: filename is /glade/derecho/scratch/mvertens/noresm_emissions/emissions_cmip6_noresm2_SO4_PR_volcCONTEXPL_vertical_1850_0.9x1.25_version20180512.nc
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: before adv_trc for file 17 709.56 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: after adv_trc for file 17 709.56 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: filename is /glade/campaign/cesm/cesmdata/inputdata/atm/cam/chem/emis/elev/H2O_emission_CH4_oxidationx2_elev_3DmonthlyL70_1850climoCMIP6piControl001_y21-50avg_c180802.nc
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: before adv_trc for file 1 679.82 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: after adv_trc for file 1 679.82 MB
- There is still a persistent and slow growing memory during the model run on derecho - but is no longer isolated to the mid-month reads.
memory_write: model date = 0001-01-02T00:00:00 memory = 1667.32 MB (highwater) 675.75 MB (usage)
memory_write: model date = 0001-12-31T00:00:00 memory = 1776.96 MB (highwater) 758.41 MB (usage)
- I believe that the above shows that there is also a slow growing memory leak on betzy - but the memcheck behavior is hiding it.
- I have looked at some recent memory output from recent runs on derecho - and there is definitely a memory increase during the CESM model run as well
HI @maritsandstad @mvdebolskiy @mvertens @gold2718 @monsieuralok The simulations have to spend quite some time queuing on Betzy, and it would help if we could run longer simulations before resubmitting. Do we have some idea and solutions for the memory leaks? Great if we can prioritize this issue
Best, Ada
@adagj - I was not able to track this down after extensive work. Looking at some issues for CTSM on derecho- the following can me done to help track this down.:
sometimes, there is a delay in the runtime system releasing memory back to the OS. You can force this release by adding a call to shr_malloc_trim() before printing out the memory usage. The shr_malloc_trim() will explicitly release all memory not needed by the runtime to the OS.
On memory leak. I tried to reduce the number of cores to get into the queue. For an AMIP type of simulation with NorESM3beta04 the OOM happened in year 5 for a ne30 run For now I will just recompile with 1920 cores, but as I should have thought about is if that if you build the model with fewer cores you also has to reduce the number of years before resubmit.
I looked into the memory log, and it is a mid-month problem as before. In addition for some random? reason the memory increase was much larger at mid-December year 1 than I saw for any December later in the simulation.