MOM5
MOM5 copied to clipboard
MOM5 using a lot of memory
We're finding that MOM5 is sometimes killed by the OS on startup. We assume this is from excessive memory use.
It would be nice to review memory consumption and try to reduce where possible. For example FMS contains lots of hard coded maximums.
FWIW, the FMS upgrade may have inadvertently caused this.
Agree that most of those structures should be replaced with dynamic arrays ("lists"), but I don't know of any particularly big ones. Maybe the MPI message buffers?
Yeah the message buffers are big. These can be reduced a bit by using:
&ocean_domains_nml max_tracers = 5 /
The default is 10.
I just emailed this, but probably this is a better location ...
Marshall, do you recall you said you would move FMS to a submodule of MOM?
As Nic has started hacking on FMS a little, and there are some things I might like to change, it would likely be cleaner and easier to do this and then rebase any changes on top of FMS updates.
Do you still have time to do this?
I can do this, but if we are just retuning parameters and GFDL doesn't adopt the changes, then we are back where we started.
That's why I think it would easier to have the changes in a separate FMS submodule. If GFDL don't like them we can just rebase our changes on top of their releases. Otherwise we have to manually find all the changes to FMS code and either create patches or just hand edit the source. Or is there an easier way?
Yeah, I don't know why I thought we'd have to use the NOAA-GFDL version. This makes sense.
I'll try to work on it tomorrow. (We are two version behind FMS anyway)
Also, does anyone have any memory profile output (massif or otherwise) showing where the memory is going?
We have known for a year or so that MOM5 is a bit of a memory hog, using about 2x of MOM6 and 4x NEMO 3.4. Would be very nice to sort this out.
I'll start a new issue for the FMS submodule task
Adele has had 2 crashes in the last dozen or so runs, and both have the same symptomatology, which is dying in the atmosphere initialisation with no stack trace:
Entering coupler_init at 20170625 085036.991
Starting initializing ensemble_manager at 20170625 085037.167
Finished initializing ensemble_manager at 20170625 085037.450
Starting to initialize diag_manager at 20170625 085038.499
Finished initializing diag_manager at 20170625 085038.888
Starting to initialize tracer_manager at 20170625 085039.179
Finished initializing tracer_manager at 20170625 085039.186
Starting to initialize coupler_types at 20170625 085039.186
Finished initializing coupler_types at 20170625 085039.187
Beginning to initialize component models at 20170625 085039.187
Starting to initialize atmospheric model at 20170625 085039.227
MXM: Got signal 15 (Terminated)
MXM: Got signal 15 (Terminated)
MXM: Got signal 15 (Terminated)
MXM: Got signal 15 (Terminated)
MXM: Got signal 15 (Terminated)
MXM: Got signal 15 (Terminated)
MXM: Got signal 15 (Terminated)
MXM: Got signal 15 (Terminated)
This is the last bit of mom.out
:
NOTE from PE 0: grid_mod/get_grid_cell_vertices: domain is not present, global data will be read
NOTE from PE 0: MPP_IO_SET_STACK_SIZE: stack size set to 38892601.
--------------------------------------------------------------------------
mpirun noticed that process rank 2176 with PID 0 on node r719 exited on signal 9 (Killed).
--------------------------------------------------------------------------
Hi @aidanheerdegen, is there any conclusion about your comment above? Is this on the 0.1 deg? I wrote about an issue that may be relevant on the FMS upstream: https://github.com/NOAA-GFDL/FMS/issues/25
Yes it is 0.1 deg models. I haven't heard any more but I assume it is still an issue. This is an example where having FMS as a submodule (https://github.com/mom-ocean/MOM5/issues/179) would make it simpler to check if we have this change in our code branch. I'm assuming we don't judging from this:
https://github.com/mom-ocean/MOM5/blob/master/src/shared/mpp/include/mpp_io_write.inc#L1207