NorESM icon indicating copy to clipboard operation
NorESM copied to clipboard

memory leak in n1850 case

Open mvertens opened this issue 1 year ago • 17 comments

Describe the bug There is a slow growning memory leak in running the NorESM2.5 N1850 case. The med.log file shows the following memory_write: model date = 0001-01-02T00:00:00 memory = 1015.75 MB (usage) memory_write: model date = 0012-04-30T00:00:00 memory = 1802.26 MB (usage) The 1802.26 is the highwater per task - and there are 128 tasks per node - so the model is hitting a required memory of around 230GB.

Please provide a clear and concise description of what the bug is.

  • NorESM version: Noresm2.5 (https://github.com/mvertens/NorESM/tree/feature/noresm2_5_alpha04_v3)
  • HPC platform: betzy
  • Compiler (if applicable): intel
  • Compset/Resolution : see https://github.com/NorESMhub/noresm_simulations/issues/30
  • Error message (if applicable):

To Reproduce Steps to reproduce the behavior: See https://github.com/NorESMhub/noresm_simulations/issues/30

mvertens avatar Aug 23 '24 18:08 mvertens

I believe the memory leak is still an issue. Transferring the milestone to NorESM3.0 so as not to lose sight of this issue.

TomasTorsvik avatar Feb 06 '25 22:02 TomasTorsvik

So, sorry if this is a total tangent, but I discovered for unrelated reasons that ESMF regridding can cause memory leaks if regridding objects are not manually deleted from wherever they are being called. This is a problem in the xesmf application see f.i. https://github.com/JiaweiZhuang/xESMF/issues/53 (which provides a solve for it in xesmf), but could some regridding memory build up which is not taken care of be a possible culprit for the periodic memory leak?

maritsandstad avatar Apr 25 '25 11:04 maritsandstad

See also: https://github.com/esmf-org/esmf/issues/277

maritsandstad avatar Apr 25 '25 11:04 maritsandstad

@adagj @maritsandstad @MichaelSchulzMETNO @oyvindseland @gold2718 @TomasTorsvik @monsieuralok - I have made progress in tracking down the memory leak. I am using the following

--compset 1850_CAM70%LT%NORESM%CAMoslo_CLM60%SP_CICE%PRES_DOCN%DOM_SROF_SGLC_SWAV_SESP
--res ne16pg3_ne16pg3_mtn14

The nice thing about this configuration is that its fast to run and there is a large memory increase right at day 15.

The memory leak occurs in mo_extfrc.F90 in the routine extfrc_timestep_init and in the call to advance_trcdata for the file /cluster/shared/noresm/inputdata/atm/cam/chem/emis/elev/H2O_emission_CH4_oxidationx2_elev_3DmonthlyL70_1850climoCMIP6piControl001_y21-50avg_c180802.nc

I think the problem is in the 3d regridding routine. This is NOT the only file that has 3d fields - however the number of levels here are around twice what the other levels are. So I have found that even with other fields in files containing a vertical dimension there is a memory leak - and it looks like its coming from the line

ierr = pio_get_var( fid, vid, strt, cnt, wrk3d )

in the routine read_3d_trc in module tracer_data.F90

This is really perplexing to me in that a pio read should not cause the memory to increase once it has already been allocated.

It was straightforward to instrument the code with memory leak - see the following

    do m = 1,n_frc_files
       write(mchar,'(i0)') m
       if (masterproc) then
          ierr = GPTLprint_memusage('memcheck: before adv_trc for file '//trim(mchar)//'for '//trim(forcings(m)%filename))
       end if
       call advance_trcdata( forcings(m)%fields, forcings(m)%file, state, pbuf2d  )
       if (masterproc) then
          ierr = GPTLprint_memusage('memcheck: after adv_trc for file '//trim(mchar)//'for '//trim(forcings(m)%filename))
       end if
    end do

The next step is to instrument advance_trcdata for the 3D forcing input/interpolation (not sure which one is causing the leak) to narrow this down further.

mvertens avatar May 01 '25 19:05 mvertens

@mvertens you are the best.

mvdebolskiy avatar May 02 '25 06:05 mvdebolskiy

@mvdebolskiy that's what I thought :)

rosiealice avatar May 02 '25 07:05 rosiealice

Great job.

This 3d field is actually not directly connected to the aerosol parameterisation as such, but needed to get a good representation of water vapor in the stratosphere. @DirkOlivie You have explained it but I can not remember why we had to read in the fields while mam4 does not.

oyvindseland avatar May 02 '25 07:05 oyvindseland

Fantastic job @mvertens !

maritsandstad avatar May 02 '25 08:05 maritsandstad

@DirkOlivie Dirk can explain it better but the reason why this file is not used anymore in CESM3 but in CAM-Nor is that CAM7 now has prognostic equations for all main GHGs. CAM7-Nor will likely use the CESM3 formulation. https://github.com/NorESMhub/CAM/issues/188

Still both CESM2 and NorESM2 used this file, so it is surprising that the memory leak cam to view in NorESM3 and not in NorESM2.

oyvindseland avatar May 05 '25 10:05 oyvindseland

@oyvindseland - thanks for the update. It's not only this file it turns out and the situation is totally perplexing. Here is where I am now.

  • I have run my case for 1 year. The memory jump seems to occur in the following call in read_3d_trc in module tracer_data.F90. This is very perplexing since there is no actual code that would cause a memory leak in this routine.
    if (masterproc) then
       write(6,*)'memcheck: cnt = ',cnt(:)
       write(6,*)'memcheck: strt= ',strt(:)
       call mem_getusage('memcheck: before pio_get_var_3d')
    end if
    ierr = pio_get_var( fid, vid, strt, cnt, wrk3d )
    if (masterproc) then
       call mem_getusage('memcheck: after  pio_get_var_3d')
    end if

  • Although there are quite a few multi- level datasets - most seem to be 30 levels or less. However, there are two 70 level datasets that seem to result in periodic spikes at mid-month. /cluster/shared/noresm/inputdata/atm/cam/chem/emis/elev/H2O_emission_CH4_oxidationx2_ele v_3DmonthlyL70_1850climoCMIP6piControl001_y21-50avg_c180802.nc
  • /cluster/shared/noresm/inputdata/atm/cam/tracer_cnst/tracer_cnst_WACCM6_halons_3DmonthlyL70_1850climoCMIP6piControl001_y21-50avg_c180802.nc
  • A summarized output of the instrumented code is below. These correspond to exactly the memory jumps in the med.log file.
 memory_write: model date = 0001-01-15T00:00:00 memory =      -0.00 MB (highwater)        768.20 MB (usage)
  0: memcheck: before adv_trc for file 17     749.23 MB
  0: memcheck: READ_NEXT_TRCDATA CH4_CHML
  0: memcheck: before read_next flds     749.23 MB
  0:  memcheck: cnt =          288         192          70           1
  0:  memcheck: strt=            1           1           1           1
  0: memcheck: before pio_get_var_3d     749.23 MB
  0: memcheck: after  pio_get_var_3d     768.02 MB
  0:  memcheck: cnt =          288         192          70           1
  0:  memcheck: strt=            1           1           1           2
  0: memcheck: before pio_get_var_3d     768.02 MB
  0: memcheck: after  pio_get_var_3d     768.20 MB
  0: memcheck: after  read_next flds     768.20 MB
  0: memcheck: after adv_trc for file 17     768.20 MB

 memory_write: model date = 0001-01-17T00:00:00 memory =      -0.00 MB (highwater)        770.60 MB (usage)
  0: memcheck: READ_NEXT_TRCDATA O3
  0: memcheck: READ_NEXT_TRCDATA OH
  0: memcheck: READ_NEXT_TRCDATA NO3
  0: memcheck: READ_NEXT_TRCDATA HO2
  0: memcheck: before read_next flds     768.20 MB
  0:  memcheck: cnt =          288         192          70           1
  0:  memcheck: strt=            1           1           1           1
  0: memcheck: before pio_get_var_3d     768.20 MB
  0: memcheck: after  pio_get_var_3d     770.57 MB

memory_write: model date = 0001-06-16T00:00:00 memory =      -0.00 MB (highwater)        785.85 MB (usage)
  0: memcheck: READ_NEXT_TRCDATA O3
  0: memcheck: READ_NEXT_TRCDATA OH
  0: memcheck: READ_NEXT_TRCDATA NO3
  0: memcheck: READ_NEXT_TRCDATA HO2
  0: memcheck: before read_next flds     770.77 MB
  0:  memcheck: cnt =          288         192          70           1
  0:  memcheck: strt=            1           1           1           6
  0: memcheck: before pio_get_var_3d     770.77 MB
  0: memcheck: after  pio_get_var_3d     785.85 MB

 memory_write: model date = 0001-08-15T00:00:00 memory =      -0.00 MB (highwater)        786.36 MB (usage)
  0: memcheck: before adv_trc for file 17     785.85 MB
  0: memcheck: READ_NEXT_TRCDATA CH4_CHML
  0: memcheck: before read_next flds     785.85 MB
  0:  memcheck: cnt =          288         192          70           1
  0:  memcheck: strt=            1           1           1           8
  0: memcheck: before pio_get_var_3d     785.85 MB
  0: memcheck: after  pio_get_var_3d     786.20 MB
  0:  memcheck: cnt =          288         192          70           1
  0:  memcheck: strt=            1           1           1           9
  0: memcheck: before pio_get_var_3d     786.20 MB
  0: memcheck: after  pio_get_var_3d     786.36 MB
  0: memcheck: after  read_next flds     786.36 MB
  0: memcheck: after adv_trc for file 17     786.36 MB

 memory_write: model date = 0001-09-15T00:00:00 memory =      -0.00 MB (highwater)        794.36 MB (usage)
  0: memcheck: before adv_trc for file 17     786.52 MB
  0: memcheck: READ_NEXT_TRCDATA CH4_CHML
  0: memcheck: before read_next flds     786.52 MB
  0:  memcheck: cnt =          288         192          70           1
  0:  memcheck: strt=            1           1           1           9
  0: memcheck: before pio_get_var_3d     786.52 MB
  0: memcheck: after  pio_get_var_3d     794.17 MB
  0:  memcheck: cnt =          288         192          70           1
  0:  memcheck: strt=            1           1           1          10
  0: memcheck: before pio_get_var_3d     794.17 MB
  0: memcheck: after  pio_get_var_3d     794.36 MB
  0: memcheck: after  read_next flds     794.36 MB
  0: memcheck: after adv_trc for file 17     794.36 MB

 memory_write: model date = 0001-09-17T00:00:00 memory =      -0.00 MB (highwater)        809.85 MB (usage)
  0: memcheck: READ_NEXT_TRCDATA O3
  0: memcheck: READ_NEXT_TRCDATA OH
  0: memcheck: READ_NEXT_TRCDATA NO3
  0: memcheck: READ_NEXT_TRCDATA HO2
  0: memcheck: before read_next flds     794.36 MB
  0:  memcheck: cnt =          288         192          70           1
  0:  memcheck: strt=            1           1           1           9
  0: memcheck: before pio_get_var_3d     794.36 MB
  0: memcheck: after  pio_get_var_3d     809.85 MB

 memory_write: model date = 0001-11-16T00:00:00 memory =      -0.00 MB (highwater)        826.45 MB (usage)
  0: memcheck: READ_NEXT_TRCDATA O3
  0: memcheck: READ_NEXT_TRCDATA OH
  0: memcheck: READ_NEXT_TRCDATA NO3
  0: memcheck: READ_NEXT_TRCDATA HO2
  0: memcheck: before read_next flds     811.03 MB
  0:  memcheck: cnt =          288         192          70           1
  0:  memcheck: strt=            1           1           1          11
  0: memcheck: before pio_get_var_3d     811.03 MB
  0: memcheck: after  pio_get_var_3d     826.45 MB
  • The above just makes no sense to me. Two dates are always read in whenever new data needs to be read - the lower and upper bound bracketing the model date. The large memory jump - when it happens - is always upon reading in the lower bound. There is often a very small memory jump upon reading in the upper bound. I have gone over this as well with @gold2718 - and he suggested that I try to rerun the one year run with no optimization - to verify that this is indeed where the memory jump is occurring.
  • I'm happy to talk about this with anyone that might have any ideas as to the above.

mvertens avatar May 05 '25 11:05 mvertens

A few more details to add to the confusion.

  • I ran the same test with gnu rather than intel (in non-debug mode) and the results look different. I see a memory spikes with READ_NEXT_TRCDATA CH4_CHML - but the memory also decreases periodically - which did not happen with intel. Also - only tracer_data.F90 is compiled -O0 optimization.
 memory_write: model date = 0001-12-23T00:00:00 memory =    7496.62 MB (highwater)        781.34 MB (usage)
 memory_write: model date = 0001-01-14T00:00:00 memory =    7320.89 MB (highwater)        733.63 MB (usage)
 memory_write: model date = 0001-01-15T00:00:00 memory =    7329.17 MB (highwater)        742.13 MB (usage) <=== same jump (READ_NEXT_TRCDATA CH4_CHML)
 memory_write: model date = 0001-01-26T00:00:00 memory =    7329.17 MB (highwater)        712.72 MB (usage) <=== new drop
 memory_write: model date = 0001-02-02T00:00:00 memory =    7334.42 MB (highwater)        719.40 MB (usage)
 memory_write: model date = 0001-02-19T00:00:00 memory =    7340.93 MB (highwater)        724.32 MB (usage)
 memory_write: model date = 0001-03-15T00:00:00 memory =    7354.96 MB (highwater)        731.57 MB (usage) (READ_NEXT_TRCDATA CH4_CHML)
 memory_write: model date = 0001-04-12T00:00:00 memory =    7366.08 MB (highwater)        730.86 MB (usage)
 memory_write: model date = 0001-04-19T00:00:00 memory =    7371.15 MB (highwater)        736.27 MB (usage)
 memory_write: model date = 0001-05-01T00:00:00 memory =    7384.34 MB (highwater)        737.54 MB (usage)
 memory_write: model date = 0001-05-14T00:00:00 memory =    7384.34 MB (highwater)        736.07 MB (usage)
 memory_write: model date = 0001-05-15T00:00:00 memory =    7393.54 MB (highwater)        745.31 MB (usage)
 memory_write: model date = 0001-06-01T00:00:00 memory =    7404.52 MB (highwater)        747.32 MB (usage)
 memory_write: model date = 0001-06-15T00:00:00 memory =    7407.57 MB (highwater)        749.53 MB (usage)
 memory_write: model date = 0001-07-01T00:00:00 memory =    7415.89 MB (highwater)        750.83 MB (usage)
 memory_write: model date = 0001-07-15T00:00:00 memory =    7421.51 MB (highwater)        756.33 MB (usage) (READ_NEXT_TRCDATA CH4_CHML)
 memory_write: model date = 0001-07-23T00:00:00 memory =    7421.51 MB (highwater)        755.03 MB (usage)
 memory_write: model date = 0001-08-15T00:00:00 memory =    7442.37 MB (highwater)        764.17 MB (usage)
 memory_write: model date = 0001-10-15T00:00:00 memory =    7472.64 MB (highwater)        773.10 MB (usage) (READ_NEXT_TRCDATA CH4_CHML)
 memory_write: model date = 0001-11-14T00:00:00 memory =    7480.75 MB (highwater)        773.29 MB (usage)
 memory_write: model date = 0001-11-15T00:00:00 memory =    7485.52 MB (highwater)        778.10 MB (usage) (READ_NEXT_TRCDATA CH4_CHML)
 memory_write: model date = 0001-11-27T00:00:00 memory =    7485.52 MB (highwater)        777.68 MB (usage)
 memory_write: model date = 0001-12-14T00:00:00 memory =    7493.12 MB (highwater)        778.72 MB (usage)
 memory_write: model date = 0001-12-15T00:00:00 memory =    7496.62 MB (highwater)        782.16 MB (usage) (READ_NEXT_TRCDATA CH4_CHML)

mvertens avatar May 07 '25 15:05 mvertens

Compiling with DEBUG mode using intel memory jumps like above still happen - just not at the same dates as with non-debug mode.

mvertens avatar May 07 '25 19:05 mvertens

  • [x] try OsloAero on derecho
  • [ ] Try betzy without OsloAero
  • [ ] metadata comaprison for emission files

mvdebolskiy avatar May 08 '25 14:05 mvdebolskiy

I managed to repeat the ne16 OsloAero run on derecho.

  • the memory jump for CH4_CHML occurs - but behaves different than on betzy It goes down again in the mediator log
Model Date: 0001-01-14T00:00:00 wall clock = 2025-05-09T03:39:19 avg dt =    18.44 s/day, dt =    18.36 s/day, rate =    12.89 ypd
 memory_write: model date = 0001-01-14T00:00:00 memory =    1671.32 MB (highwater)        679.45 MB (usage)
Model Date: 0001-01-15T00:00:00 wall clock = 2025-05-09T03:39:38 avg dt =    18.45 s/day, dt =    18.61 s/day, rate =    12.72 ypd
 memory_write: model date = 0001-01-15T00:00:00 memory =    1701.32 MB (highwater)        709.56 MB (usage)
Model Date: 0001-01-16T00:00:00 wall clock = 2025-05-09T03:39:58 avg dt =    18.57 s/day, dt =    20.25 s/day, rate =    11.69 ypd
 memory_write: model date = 0001-01-16T00:00:00 memory =    1687.71 MB (highwater)        697.05 MB (usage)

And also behaves different in the instrumentation

dec0945.hsn.de.hpc.ucar.edu 0: memcheck: filename is /glade/campaign/cesm/cesmdata/inputdata/atm/cam/chem/emis/elev/H2O_emission_CH4_oxidationx2_elev_3DmonthlyL70_1850climoCMIP6piControl001_y21-50avg_c180802.nc
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: before adv_trc for file 1     679.70 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: READ_NEXT_TRCDATA CH4_CHML
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: before read_next flds     679.70 MB
dec0945.hsn.de.hpc.ucar.edu 0:  memcheck: cnt =          288         192          70           1
dec0945.hsn.de.hpc.ucar.edu 0:  memcheck: strt=            1           1           1           1
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: before pio_get_var_3d     681.70 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: after  pio_get_var_3d     709.45 MB
dec0945.hsn.de.hpc.ucar.edu 0:  memcheck: cnt =          288         192          70           1
dec0945.hsn.de.hpc.ucar.edu 0:  memcheck: strt=            1           1           1           2
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: before pio_get_var_3d     681.82 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: after  pio_get_var_3d     709.56 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: after  read_next flds     709.56 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: after adv_trc for file 1     709.56 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: filename is /glade/derecho/scratch/mvertens/noresm_emissions/emissions_cmip6_noresm2_SO4_PR_volcCONTEXPL_vertical_1850_0.9x1.25_version20180512.nc
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: before adv_trc for file 17     709.56 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: after adv_trc for file 17     709.56 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: filename is /glade/campaign/cesm/cesmdata/inputdata/atm/cam/chem/emis/elev/H2O_emission_CH4_oxidationx2_elev_3DmonthlyL70_1850climoCMIP6piControl001_y21-50avg_c180802.nc
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: before adv_trc for file 1     679.82 MB
dec0945.hsn.de.hpc.ucar.edu 0: memcheck: after adv_trc for file 1     679.82 MB
  • There is still a persistent and slow growing memory during the model run on derecho - but is no longer isolated to the mid-month reads.
memory_write: model date = 0001-01-02T00:00:00 memory =    1667.32 MB (highwater)        675.75 MB (usage)
memory_write: model date = 0001-12-31T00:00:00 memory =    1776.96 MB (highwater)        758.41 MB (usage)
  • I believe that the above shows that there is also a slow growing memory leak on betzy - but the memcheck behavior is hiding it.
  • I have looked at some recent memory output from recent runs on derecho - and there is definitely a memory increase during the CESM model run as well

mvertens avatar May 12 '25 08:05 mvertens

HI @maritsandstad @mvdebolskiy @mvertens @gold2718 @monsieuralok The simulations have to spend quite some time queuing on Betzy, and it would help if we could run longer simulations before resubmitting. Do we have some idea and solutions for the memory leaks? Great if we can prioritize this issue

Best, Ada

adagj avatar Aug 29 '25 14:08 adagj

@adagj - I was not able to track this down after extensive work. Looking at some issues for CTSM on derecho- the following can me done to help track this down.: sometimes, there is a delay in the runtime system releasing memory back to the OS. You can force this release by adding a call to shr_malloc_trim() before printing out the memory usage. The shr_malloc_trim() will explicitly release all memory not needed by the runtime to the OS.

mvertens avatar Aug 29 '25 14:08 mvertens

On memory leak. I tried to reduce the number of cores to get into the queue. For an AMIP type of simulation with NorESM3beta04 the OOM happened in year 5 for a ne30 run For now I will just recompile with 1920 cores, but as I should have thought about is if that if you build the model with fewer cores you also has to reduce the number of years before resubmit.

I looked into the memory log, and it is a mid-month problem as before. In addition for some random? reason the memory increase was much larger at mid-December year 1 than I saw for any December later in the simulation.

oyvindseland avatar Nov 10 '25 10:11 oyvindseland