parsec
parsec copied to clipboard
Collective communication with DTD interface
Original report by Florent Lopez (Bitbucket: mermoz, GitHub: mermoz).
Before I start, I want to mention that I am currently using Parsec topic/collective
branch from the Reazul Parsec code (https://bitbucket.org/rhoque_icl/parsec-dtd-interface) merged with the master branch of Parsec. The Slate code I am working on is available here https://bitbucket.org/mermoz/slate/.
I am essentially trying to run a Slate Cholesky factorization implemented on top of Parsec that was originally Damien’s code that you can find here (https://bitbucket.org/icldistcomp/slate/ branch parsec_v4
).
Although the code seems to be working well in multicore, it fails to run when using 2 MPI processes. The issue appears to be in the collective communications. These collective communication are used in two cases: first sending the diagonal tile to the sub-diagonal tiles to perform a trsm solve, this is implemented in the parsec_tileSend
routine (BaseMatrix.hh
), Second, sending the subdiagonal tiles to the trailing matrix for a gemm/syrk update. This is implemented in the parsec_async_tileSend
(also in BaseMatrix.hh
). These two implementation slightly differ in the fact that the first is synchronous and the second is asynchronous. I still do not fully understand these two implementation as they rely on relatively low level Parsec routines and so far I have been unsuccessfully be trying to use these routine. The issue I currently enconter is an assert that is raised with the following message:
test: /home/flopez/Runtime/parsec-dtd-interface/parsec/remote_dep_mpi.c:2053: remote_dep_mpi_recv_activate: Assertion 'length == *position' failed.
Ideally, what would be nice to have in Parsec is the ability to call a routine that would simply take a source tiles and a bunch of destination tiles (along with some layout information maybe) and handle the collective communication for me. I believe that in this case Parsec would have enough information for determining which process should be involved in the communication. At the moment this information has to be given by the user along with setting up complex data structure that are not at all documented such as parsec_remote_deps_t
, remote_dep_output_param
, etc .
Many Thanks,
Florent