yt icon indicating copy to clipboard operation
yt copied to clipboard

Projecting 1024^3 Enzo runs out of memory.

Open dcollins4096 opened this issue 6 years ago • 7 comments

Bug report

Bug summary

Trying to project in parallel causes an out-of-memory error. The test is a 1024^3 unigrid Enzo simulation, run on 4096 cores/tiles, adiabatic MHD with Dedner and random forcing. The dataset has 16 fields, so one snapshot is 129Gb.

Code for reproduction

% ibrun -np 2 python parallel_test.py
from mpi4py import MPI
import yt
yt.enable_parallelism()

ds = yt.load("/scratch/00369/tg456484/Paper49d_moresims/ze01_M10_MA1_1024_quan/DD0095/data0095")
p = yt.ProjectionPlot(ds, "x", "density")
p.save()

The parameter file in equivalent to run/MHD/3D/StochasticForcing/StochasticForcing.enzo with DualEnergy turned off.

Actual outcome

A representative sample of output is here, I'll edit to add a paste of the whole stack trace once I've properly captured it.


  File "/home1/00369/tg456484/local-yt-2019-02-22-py3/yt-conda/lib/python3.7/site-packages/yt/visualization/plot_window.py", line 1480, in 
__init__
    max_level=max_level)
  File "/home1/00369/tg456484/local-yt-2019-02-22-py3/yt-conda/lib/python3.7/site-packages/yt/data_objects/construction_data_containers.py"
, line 270, in __init__
    self.get_data(field)
  File "/home1/00369/tg456484/local-yt-2019-02-22-py3/yt-conda/lib/python3.7/site-packages/yt/data_objects/construction_data_containers.py"
, line 334, in get_data
    self._initialize_chunk(chunk, tree)
  File "/home1/00369/tg456484/local-yt-2019-02-22-py3/yt-conda/lib/python3.7/site-packages/yt/data_objects/construction_data_containers.py"
, line 401, in _initialize_chunk
    icoords = chunk.icoords
  File "/home1/00369/tg456484/local-yt-2019-02-22-py3/yt-conda/lib/python3.7/site-packages/yt/data_objects/data_containers.py", line 1555, 
in icoords
    return self._current_chunk.icoords
  File "/home1/00369/tg456484/local-yt-2019-02-22-py3/yt-conda/lib/python3.7/site-packages/yt/geometry/geometry_handler.py", line 271, in c
ached_func
    tr = func(self)
  File "/home1/00369/tg456484/local-yt-2019-02-22-py3/yt-conda/lib/python3.7/site-packages/yt/geometry/geometry_handler.py", line 332, in i
coords
    ci = np.empty((self.data_size, 3), dtype='int64')
P006 yt : [ERROR    ] 2019-02-22 13:03:36,145 MemoryError:
  File "parallel_test.py", line 13, in <module>
    p = yt.ProjectionPlot(ds, "x", "density")
  File "/home1/00369/tg456484/local-yt-2019-02-22-py3/yt-conda/lib/python3.7/site-packages/yt/visualization/plot_window.py", line 1480, in 
__init__
    max_level=max_level)
  File "/home1/00369/tg456484/local-yt-2019-02-22-py3/yt-conda/lib/python3.7/site-packages/yt/data_objects/construction_data_containers.py"
, line 270, in __init__
    self.get_data(field)
  File "/home1/00369/tg456484/local-yt-2019-02-22-py3/yt-conda/lib/python3.7/site-packages/yt/data_objects/construction_data_containers.py"
, line 334, in get_data
    self._initialize_chunk(chunk, tree)
  File "/home1/00369/tg456484/local-yt-2019-02-22-py3/yt-conda/lib/python3.7/site-packages/yt/data_objects/construction_data_containers.py"
, line 401, in _initialize_chunk
    icoords = chunk.icoords
  File "/home1/00369/tg456484/local-yt-2019-02-22-py3/yt-conda/lib/python3.7/site-packages/yt/data_objects/data_containers.py", line 1555, 
in icoords
    return self._current_chunk.icoords
  File "/home1/00369/tg456484/local-yt-2019-02-22-py3/yt-conda/lib/python3.7/site-packages/yt/geometry/geometry_handler.py", line 271, in c
ached_func
    tr = func(self)
  File "/home1/00369/tg456484/local-yt-2019-02-22-py3/yt-conda/lib/python3.7/site-packages/yt/geometry/geometry_handler.py", line 332, in i
coords
    ci = np.empty((self.data_size, 3), dtype='int64')
P007 yt : [ERROR    ] 2019-02-22 13:03:36,145 MemoryError:
P002 yt : [ERROR    ] 2019-02-22 13:03:36,200 Error occured on rank 2.
P004 yt : [ERROR    ] 2019-02-22 13:03:36,200 Error occured on rank 4.
P000 yt : [ERROR    ] 2019-02-22 13:03:36,200 Error occured on rank 0.
P001 yt : [ERROR    ] 2019-02-22 13:03:36,200 Error occured on rank 1.
P003 yt : [ERROR    ] 2019-02-22 13:03:36,200 Error occured on rank 3.
P005 yt : [ERROR    ] 2019-02-22 13:03:36,200 Error occured on rank 5.
P006 yt : [ERROR    ] 2019-02-22 13:03:36,200 Error occured on rank 6.
P007 yt : [ERROR    ] 2019-02-22 13:03:36,200 Error occured on rank 7.
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

Expected outcome

The expectation is a projection plot, but faster.

Version Information

This was run on Stampede2, on one or two idev nodes. Python 3.7.2, gcc7.3.0, yt3.5.0. The full install was done on 2019-02-22.

dcollins4096 avatar Feb 22 '19 22:02 dcollins4096

I believe @atmyers ran into a similar issue at one point and might have thoughts.

ngoldbaum avatar Feb 22 '19 22:02 ngoldbaum

I found that turning off mask caching helped with memory usage:

index 966f0a068..e826e4f78 100644
--- a/yt/data_objects/grid_patch.py
+++ b/yt/data_objects/grid_patch.py
@@ -42,7 +42,7 @@ class AMRGridPatch(YTSelectionContainer):
     _num_ghost_zones = 0
     _grids = None
     _id_offset = 1
-    _cache_mask = True
+    _cache_mask = False
 
     _type_name = 'grid'
     _skip_add = True

atmyers avatar Feb 22 '19 22:02 atmyers

That was successful, thanks! d.

On Fri, Feb 22, 2019 at 5:10 PM Andrew Myers [email protected] wrote:

I found that turning off mask caching helped with memory usage:

` diff --git a/yt/data_objects/grid_patch.py b/yt/data_objects/grid_patch.py index 966f0a068..e826e4f78 100644 --- a/yt/data_objects/grid_patch.py +++ b/yt/data_objects/grid_patch.py @@ -42,7 +42,7 @@ class AMRGridPatch(YTSelectionContainer): _num_ghost_zones = 0 _grids = None _id_offset = 1

  • _cache_mask = True

_cache_mask = False

_type_name = 'grid' _skip_add = True

`

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yt-project/yt/issues/2154#issuecomment-466565541, or mute the thread https://github.com/notifications/unsubscribe-auth/AbyDmN8VjaXWQtZGvf7aC4KELlq2kqQsks5vQGq_gaJpZM4bLJHA .

-- -- Sent from a computer.

dcollins4096 avatar Feb 25 '19 14:02 dcollins4096

If this is to do with masks as in boolean masks then it might be useful to consider numpy.packbits to reduce the size of them - I came across this function semi-recently

AshKelly avatar Mar 07 '19 17:03 AshKelly

@AshKelly Ah, yeah, that's a good idea. That would indeed reduce it. In yt-4.0 we have EWAH bitmaps, which do some really neat encoding of bitmaps and packing to speed it up, but either project might take a bit.

I suspect, but am not 100% sure, that the biggest memory factor is in the ires and icoords info. One way to simplify this would be to change how the projection manages chunks, and allow it to have a "special case" where the coords are generated by some functional form, so the coords did not all have to be generated.

I think we should leave this issue open, or possibly open one explicitly about reducing memory overhead, as it is a real problem -- and one that only showed up when we made things more general!

matthewturk avatar Mar 07 '19 18:03 matthewturk

IIRC, the mask was getting allocated to cover the entire domain on every process, not just for the subset of the domain that the process owned. Even with packing, this will blow out memory eventually for large, unigrid datasets.

atmyers avatar Mar 07 '19 18:03 atmyers

@atmyers Ahhh, I see. OK. I think this is something that will need to be addressed -- but we need to figure out how desperate the need is, and if it is blocked by anything else.

matthewturk avatar Mar 07 '19 18:03 matthewturk