heat icon indicating copy to clipboard operation
heat copied to clipboard

[Bug]: QR decomposition fails under certain circumstances

Open mrfh92 opened this issue 2 years ago • 2 comments

What happened?

QR decomposition fails for sufficiently large matrix/sufficiently small number of processes if split=1. For split=0 this behaviour does not occur.

Example: The code snipped below produces an error when being executed on 2 MPI processes, but not on 3 processes.

Observation: For a fixed matrix size the error can be avoided by increasing the number of processes sufficiently. Nevertheless, for a fixed number of processes the error occurs again as soon as the matrix is chosen sufficiently large. Moreover, if -as in the example- for a given number of processes the matrix size is just large enough to produce the error, the error can be avoided when changing the datatype from float64 to float32.

Rough first guess: Some problem with the size of the MPI-messages (?) - but I was not able to figure out the details.

Code snippet triggering the error

"""
Execute on 2 MPI processes 
"""

import heat as ht
from mpi4py import MPI

splitdim = 1 

A = ht.random.randn(100,33,dtype=ht.float64,split=splitdim)  
Q,R = ht.linalg.qr(A)

Error message or erroneous outcome

File "/home/***/heat/heat/core/linalg/qr.py", line 172, in qr
    __split1_qr_loop(dcol=dcol, r_tiles=r_tiles, q0_tiles=q_tiles, calc_q=calc_q)
  File "/home/***/heat/heat/core/linalg/qr.py", line 929, in __split1_qr_loop
    r_tiles.arr.comm.Bcast(q1, root=diag_process)
  File "/home/***/heat/heat/core/communication.py", line 727, in Bcast
    ret, sbuf, rbuf, buf = self.__broadcast_like(self.handle.Bcast, buf, root)
  File "/home/***/heat/heat/core/communication.py", line 714, in __broadcast_like
    return func(self.as_buffer(srbuf), root), srbuf, srbuf, buf
  File "mpi4py/MPI/Comm.pyx", line 691, in mpi4py.MPI.Comm.Bcast
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated

Version

1.2.x

Python version

3.8

PyTorch version

1.11

MPI version

OpenMPI

mrfh92 avatar Aug 25 '22 13:08 mrfh92

Hi @mrfh92 ,

thanks for reporting this!

I cannot reproduce the error on my machine. Please let me know what version of OpenMPI and mpi4py you're using, so I can try to recreate the same environment.

Cheers,

Claudia

ClaudiaComito avatar Aug 28 '22 02:08 ClaudiaComito

Hi Claudia,

here are the versions I use:

mpi4py 3.1.3 OpenMPI 4.0.3 heat 1.2.0-dev torch 1.11.0+cu102 Python 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]

Everythin is run in a python venv virtual environment on a Ubuntu 20.04.4 LTS system.

Greetings, Fabian

mrfh92 avatar Aug 29 '22 07:08 mrfh92

what branch are you on? can you try this on the features/436-SVD-block-diag branch? there are some features there which are not fully functional, but there were some changes to the qr logic. I think one of which would fix this issue.

coquelin77 avatar Oct 26 '22 09:10 coquelin77

With the qr implementation from the features/436-SVD-block-diag branch I could not reproduce the error, so it looks like that this really fixed the isse 👍

mrfh92 avatar Oct 27 '22 07:10 mrfh92

I'm closing this as we can't reproduce it or we have fixed it inadvertently.

ClaudiaComito avatar Apr 27 '23 17:04 ClaudiaComito