devito icon indicating copy to clipboard operation
devito copied to clipboard

AlltoAll for large problem

Open mloubout opened this issue 4 years ago • 10 comments

The AlltoAll calls for MPI make Devito crash for large problems.

mloubout avatar Jan 25 '20 01:01 mloubout

where (what python line) and reproducer

FabioLuporini avatar Jan 25 '20 14:01 FabioLuporini

error trace too would be nice

FabioLuporini avatar Jan 25 '20 14:01 FabioLuporini

  File "/usr/local/lib/python3.6/dist-packages/devito/operator/operator.py", line 520, in arguments
    args = self._prepare_arguments(**kwargs)
  File "/usr/local/lib/python3.6/dist-packages/devito/operator/operator.py", line 419, in _prepare_arguments
    args.update(p._arg_values(**kwargs))
  File "/usr/local/lib/python3.6/dist-packages/devito/types/sparse.py", line 287, in _arg_values
    values = new._arg_defaults(alias=self).reduce_all()
  File "/usr/local/lib/python3.6/dist-packages/devito/tools/memoization.py", line 91, in __call__
    res = cache[key] = self.func(*args, **kw)
  File "/usr/local/lib/python3.6/dist-packages/devito/types/sparse.py", line 267, in _arg_defaults
    for k, v in self._dist_scatter().items():
  File "/usr/local/lib/python3.6/dist-packages/devito/types/sparse.py", line 821, in _dist_scatter
    [scattered, rcount, rdisp, mpitype])
  File "mpi4py/MPI/Comm.pyx", line 676, in mpi4py.MPI.Comm.Alltoallv
  File "mpi4py/MPI/msgbuffer.pxi", line 592, in mpi4py.MPI._p_msg_cco.for_alltoall
  File "mpi4py/MPI/msgbuffer.pxi", line 456, in mpi4py.MPI._p_msg_cco.for_cco_recv
  File "mpi4py/MPI/msgbuffer.pxi", line 300, in mpi4py.MPI.message_vector
  File "mpi4py/MPI/asarray.pxi", line 22, in mpi4py.MPI.chkarray
  File "mpi4py/MPI/asarray.pxi", line 15, in mpi4py.MPI.getarray
OverflowError: value too large to convert to int

mloubout avatar Jan 26 '20 00:01 mloubout

command line to reproduce ? can you write an MFE? this seems to be due to the data distribution of SparseFunctions.

FabioLuporini avatar Jan 26 '20 15:01 FabioLuporini

command line to reproduce ? can you write an MFE?

Not really, just add a massive number of receivers in any example and at some point will crash like that. All examples are setup for tiny number of receivers so wouldn't pop up

mloubout avatar Jan 26 '20 15:01 mloubout

Not really, just add a massive number of receivers in any example

so we should be able to write a 5-6 lines MFE. I'll try to reproduce

FabioLuporini avatar Jan 26 '20 18:01 FabioLuporini

can we close this? @mloubout

FabioLuporini avatar Feb 06 '20 07:02 FabioLuporini

No, the PRs improved the set-up time for larger receivers (still issues with full size 3D I trying to track) but this error is not related, this is due to message size so will happen, trying to find a fix for that too

mloubout avatar Feb 06 '20 11:02 mloubout

@mloubout - you should not be expecting to see integer OverflowError unless you are running in the order of a couple of billion dof's. How large is your problem? If it really is that big then we have to ensure our indexing supports int64.

ggorman avatar Feb 07 '20 09:02 ggorman

s you are running in the order of a couple of billion dof's

You don't need to go that big to be way over that. 3D receivers, OBN setup with reciprocity:

  • ~2M rec positions ~ 10-20k time steps

And you have couple tens of billions

mloubout avatar Feb 07 '20 11:02 mloubout