omuse icon indicating copy to clipboard operation
omuse copied to clipboard

Running Delft3DFM model with multiple workers and ERA5 external forcing crashes

Open JaroCamphuijsen opened this issue 1 year ago • 2 comments

When running the Delft3DFM model with multiple workers, while using ERA5 netcdf files for global forcing through the forcing interface of Delft3DFM itself (instead of using the ERA5 interface in OMUSE), the run crashes with the following error message:

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[53422,2],9]
  Exit code:    245

The problem seems to be that each Delft3DFM worker tries to load the same 10GB netcdf file (as specified in the ExtForceFile file with .ext extension) in memory at the same time. This would require 320 GB of memory in the case of 32 workers.

This problem came up because we want to use different custom forcing files and not just the default ERA5 forcing.

JaroCamphuijsen avatar Sep 12 '23 10:09 JaroCamphuijsen

A short term solution is to split the large forcing files into small pieces so that the workers together will not exceed the available memory. However, in this way, all workers still load the same files into memory. A better solution would be to perform the external forcing preparation before starting the actual workers. I suppose this problem is more prevalent in other model codes in environmental science due to the use of large input datasets, so a common approach to this problem might be worth thinking of.

JaroCamphuijsen avatar Sep 12 '23 10:09 JaroCamphuijsen

A better solution to this problem is in fact proposed in #82 . It will not solve this specific problem (the crash upon specifying the external forcing in the .ext file), but this came up because we wanted to use slightly preprocessed ERA5 forcing files in combination with the Delft3D model, which is not possible using the ERA5 interface. However if we create an interface for forcing data in general, we do not have to use the external forcing capabilities of the Delft3D model itself but do it all via OMUSE.

For other use cases this issue might still be relevant, and it should at least be documented that this use of Delft3DFM in OMUSE is not supported.

JaroCamphuijsen avatar Oct 11 '23 09:10 JaroCamphuijsen