heat
heat copied to clipboard
Datatype tiling for large communication
Description Heat allows to use various wrapped MPI calls to transmit data between processor (e.g. replit()). If the buffer of such a transmission is too large, i.e. exceeding the int32 value range, MPI will quit with an error.
Could perhaps be fixed using the tiling implementation of the QR branch.
To Reproduce Steps to reproduce the behavior:
- Which module/class/function is affected? communications.py
- What are the circumstances under which the bug appears? Several, e.g.:
a = ht.zeros(((INT32_MAX + 1) * processors, processors), split=0).resplit(1)
- What is the exact error-message/errorous behavious? Depends on MPI implementation
Expected behavior No MPI error
Version Info any
small update on this one. the #520 PR has a new tiling class. theoretically, this could be modified to cope with this by only sending partial tiles. although it may require a fair bit of changes.
In principle relevant, although not of highest priority because this problem can be solved by increasing the number of processes usually. (Reviewed within #1109 )
My question @Markus-Goetz: should this issue address the wrappers for the MPI-operations (i.e. heat.comm.Send()
performs several mpi4py.MPI.comm.Send()
if the data to send is too large) or shall we rather adapt the usage of heat.comm.Send()
in those algorithms where potentially large data are sent? -- The first idea sounds more elegant, however, w.r.t. #383 the second option may allow better refactoring of algorithms including overlap of communication and computation.