parsec icon indicating copy to clipboard operation
parsec copied to clipboard

Recursive + GPU seems broken

Open therault opened this issue 1 year ago • 4 comments

Describe the bug

An assert triggers when running a PTG test that combines recursive and CUDA bodies.

To Reproduce

Steps to reproduce the behavior:

  1. Checkout version master (530601533fc80ad)
  2. Compile on a machine with CUDA with NOISIER and PARANOID debugs, and CUDA enabled, and RECURSIVE support enabled
  3. Install, compile dplasma master (https://github.com/ICLDisco/dplasma/commit/8b36497820401a4ceb7739c95d0d806e2c70e7b1) using this install of parsec
  4. Run test testing_spotrf -N 256 -t 128 -x 1 -g 1 on a machine with a GPU
  5. See assert triggering.

Additional context

I tracked the issue, suspecting a problem with versioning from Aurelien's report, but it appears that we run first the CUDA kernel of POTRF(1), then we proceed to run the RECURSIVE kernel of POTRF(1) (which messes up the status of the CUDA copies, leading to unexpected issues raised by the assert).

Followup tasks

  • [ ] When this issue is resolved, this PR needs to be undone. https://github.com/ICLDisco/dplasma/issues/84

therault avatar May 24 '23 21:05 therault