heat
heat copied to clipboard
[Bug]: QR decomposition fails under certain circumstances
What happened?
QR decomposition fails for sufficiently large matrix/sufficiently small number of processes if split=1. For split=0 this behaviour does not occur.
Example: The code snipped below produces an error when being executed on 2 MPI processes, but not on 3 processes.
Observation: For a fixed matrix size the error can be avoided by increasing the number of processes sufficiently. Nevertheless, for a fixed number of processes the error occurs again as soon as the matrix is chosen sufficiently large. Moreover, if -as in the example- for a given number of processes the matrix size is just large enough to produce the error, the error can be avoided when changing the datatype from float64 to float32.
Rough first guess: Some problem with the size of the MPI-messages (?) - but I was not able to figure out the details.
Code snippet triggering the error
"""
Execute on 2 MPI processes
"""
import heat as ht
from mpi4py import MPI
splitdim = 1
A = ht.random.randn(100,33,dtype=ht.float64,split=splitdim)
Q,R = ht.linalg.qr(A)
Error message or erroneous outcome
File "/home/***/heat/heat/core/linalg/qr.py", line 172, in qr
__split1_qr_loop(dcol=dcol, r_tiles=r_tiles, q0_tiles=q_tiles, calc_q=calc_q)
File "/home/***/heat/heat/core/linalg/qr.py", line 929, in __split1_qr_loop
r_tiles.arr.comm.Bcast(q1, root=diag_process)
File "/home/***/heat/heat/core/communication.py", line 727, in Bcast
ret, sbuf, rbuf, buf = self.__broadcast_like(self.handle.Bcast, buf, root)
File "/home/***/heat/heat/core/communication.py", line 714, in __broadcast_like
return func(self.as_buffer(srbuf), root), srbuf, srbuf, buf
File "mpi4py/MPI/Comm.pyx", line 691, in mpi4py.MPI.Comm.Bcast
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated
Version
1.2.x
Python version
3.8
PyTorch version
1.11
MPI version
OpenMPI
Hi @mrfh92 ,
thanks for reporting this!
I cannot reproduce the error on my machine. Please let me know what version of OpenMPI and mpi4py you're using, so I can try to recreate the same environment.
Cheers,
Claudia
Hi Claudia,
here are the versions I use:
mpi4py 3.1.3 OpenMPI 4.0.3 heat 1.2.0-dev torch 1.11.0+cu102 Python 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
Everythin is run in a python venv virtual environment on a Ubuntu 20.04.4 LTS system.
Greetings, Fabian
what branch are you on? can you try this on the features/436-SVD-block-diag
branch? there are some features there which are not fully functional, but there were some changes to the qr logic. I think one of which would fix this issue.
With the qr implementation from the features/436-SVD-block-diag branch I could not reproduce the error, so it looks like that this really fixed the isse 👍
I'm closing this as we can't reproduce it or we have fixed it inadvertently.