parsec Topic/cuda aware communications

Add support for sending and receiving the data directly from and to devices. There are few caveats (noted on the commit log).

Note: because it includes the span renaming, this PR changes teh public API and will need to bump version to 5.x

The first question is how is such a device selected ?

The allocation of such a copy happen way before the scheduler is invoked for a task, in fact before the task is even ready. Thus, we need to decide on the location of this copy only based on some static information, such as the task affinity. Therefore, this approach only works for owner-compute type of tasks, where the task will be executed on the device that owns the data used for the task affinity.

Pass the correct data copy across the entire system, instead of falling back to data copy of the device 0 (CPU memory)

TODOs

[x] rebase on c11 atomic fix
[x] Add a configure option to enable GPU-aware communications.
[x] Add a runtime configuration to turn on/off the gpu-aware comms?
[x] Pass -g 2 tests
[ ] Failure with ctest get_best_device scheduling.c:157: int __parsec_execute(parsec_execution_stream_t *, parsec_task_t *): Assertion NULL != copy->original && NULL != copy->original->device_copies[0]'
[ ] Failure with ctest nvlink, stress (segfault), details of why (its because using NEW) https://github.com/ICLDisco/parsec/pull/671#issuecomment-2588223287
[ ] Failure with ctest stage (presumably identical to intermittent failure in gemm/potrf) device_gpu.c:2470: int parsec_device_kernel_epilog(parsec_device_gpu_module_t *, parsec_gpu_task_t *): Assertion PARSEC_DATA_STATUS_UNDER _TRANSFER == cpu_copy->data_transfer_status' failed.
[ ] RO data between tasks may reach an assert when doing D2D between devices that do not have peer_access between them (relevant only for weirdos)
[x] POTRF Crashes on Frontier (due to using uninitialized/after free device_copies https://github.com/ICLDisco/parsec/pull/671/commits/8976f0b00bdc375171b4d4bd07f39e5744773370
[x] readers values are miscounted when 2 or more GPUs are used per rank https://github.com/ICLDisco/parsec/pull/671#issuecomment-2408016948)

Sep 10 '24 04:09 bosilca

This crashes in dpotrf, an original used in debug outputs during RELEASE_DEP_OUTPUT (generated code for T) is not initialized (neither NULL nor valid).

"testing_dpotrf" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffe757fa000 (LWP 1529612)]
parsec_task_snprintf (str=0x7ffe757ef0a0 "potrf_dtrsm(0, 1)[0, 1[, 1][, 5]]<45> keys = {*", size=128, task=0x7ffe757ef200) at /home/bouteill/parsec/dplasma-master/parsec/parsec/parsec.c:1952
1952                index += snprintf(str + index, size - index, "%s%lx", prefix, task->data[i].data_in->original->key);
(gdb) bt
#0  parsec_task_snprintf (str=0x7ffe757ef0a0 "potrf_dtrsm(0, 1)[0, 1[, 1][, 5]]<45> keys = {*", size=128, task=0x7ffe757ef200) at /home/bouteill/parsec/dplasma-master/parsec/parsec/parsec.c:1952
#1  0x00007ffff56e6b03 in iterate_successors_of_dpotrf_U_potrf_dpotrf (es=0x7ffe5c000bf0, this_task=0x107d990, action_mask=989855747, ontask=0x7ffff4218207 <parsec_set_up_reshape_promise>, ontask_arg=0x7ffe757ef6a0)
    at /home/bouteill/parsec/dplasma-master/build.hip/src/dpotrf_U.c:6919
#2  0x00007ffff56e76a5 in release_deps_of_dpotrf_U_potrf_dpotrf (es=0x7ffe5c000bf0, this_task=0x107d990, action_mask=989855747, deps=0x0) at /home/bouteill/parsec/dplasma-master/build.hip/src/dpotrf_U.c:7046
#3  0x00007ffff56e8f84 in complete_hook_of_dpotrf_U_potrf_dpotrf (es=0x7ffe5c000bf0, this_task=0x107d990) at /home/bouteill/parsec/dplasma-master/build.hip/src/dpotrf_U.c:7801

(gdb) p * task->data[i].data_in->original                                                                                                                                                                                                 
Cannot access memory at address 0xeda034a1

Oct 02 '24 20:10 abouteiller

On Frontier, there is a problem with the initializer for gdata->lock = PARSEC_ATOMIC_UNLOCKED; in zone_malloc_init, it doesn't compile.

[  0%] Building C object parsec/parsec/CMakeFiles/parsec-base-obj.dir/utils/zone_malloc.c.o
/ccs/home/bouteilla/parsec/dplasma/parsec/parsec/utils/zone_malloc.c:41:33: error: expected expression
   41 |     gdata->lock               = PARSEC_ATOMIC_UNLOCKED;
      |                                 ^
/ccs/home/bouteilla/parsec/dplasma/parsec/parsec/include/parsec/sys/atomic-c11.h:219:32: note: expanded from macro 'PARSEC_ATOMIC_UNLOCKED'
  219 | #define PARSEC_ATOMIC_UNLOCKED ATOMIC_FLAG_INIT
      |                                ^
/opt/cray/pe/cce/17.0.0/cce-clang/x86_64/lib/clang/17/include/stdatomic.h:171:26: note: expanded from macro 'ATOMIC_FLAG_INIT'
  171 | #define ATOMIC_FLAG_INIT { 0 }
      |                          ^
1 error generated.

Suspected that we have a mish-mash of c11/stdatomic types, maybe a problem with include file order, need to dig.

Oct 02 '24 20:10 abouteiller

Suspected that we have a mish-mash of c11/stdatomic types, maybe a problem with include file order, need to dig.

No, this is invalid C11. ATOMIC_FLAG_INIT can only be used for static initialization, not in a dynamic context.

Oct 02 '24 20:10 devreal

Now passing 1-gpu/node, 8 ranks PTG POTRF Sorry I had to force-push there were issues with rebasing on master

Oct 11 '24 18:10 abouteiller

UCX_HANDLE_ERRORS=freeze PMIX_MCA_psec='' srun -wleconte -N1 -n1 --cpu-bind=v,ldoms --gpus-per-task=2 --pty gdb --args tests/testing_dpotrf -N $((10*1024)) -t $((8*256)) --nruns 1 -x -v=2 -g 2 -p 1 -c 4 -- --mca bind_threads 1

#4  0x00007fff9e7c9386 in __assert_fail () from /lib64/libc.so.6
#5  0x00007ffff4211481 in parsec_device_kernel_pop (gpu_device=0xee94b0, gpu_task=0x7ff048002ea0, gpu_stream=0xee9968)
    at /home/bouteill/parsec/dplasma-master/parsec/parsec/mca/device/device_gpu.c:2293
#6  0x00007ffff4210259 in parsec_device_progress_stream (gpu_device=0xee94b0, stream=0xee9968, progress_fct=0x7ffff4210f28 <parsec_device_kernel_pop>,
    task=0x7ff048002ea0, out_task=0x7ffffffe86e0) at /home/bouteill/parsec/dplasma-master/parsec/parsec/mca/device/device_gpu.c:2009
#7  0x00007ffff4212b41 in parsec_device_kernel_scheduler (module=0xee94b0, es=0x95f250, _gpu_task=0x3a529a0)
    at /home/bouteill/parsec/dplasma-master/parsec/parsec/mca/device/device_gpu.c:2706
#8  0x00007ffff56c5cac in hook_of_dpotrf_U_potrf_dtrsm_CUDA (es=0x95f250, this_task=0x7ff044006f00)
    at /home/bouteill/parsec/dplasma-master/build.cuda/src/dpotrf_U.c:6018
#9  0x00007ffff41e54ea in __parsec_execute (es=0x95f250, task=0x7ff044006f00) at /home/bouteill/parsec/dplasma-master/parsec/parsec/scheduling.c:182
#10 0x00007ffff41e617d in __parsec_task_progress (es=0x95f250, task=0x7ff044006f00, distance=1) at /home/bouteill/parsec/dplasma-master/parsec/parsec/scheduling.c:503
#11 0x00007ffff41e6cf5 in __parsec_context_wait (es=0x95f250) at /home/bouteill/parsec/dplasma-master/parsec/parsec/scheduling.c:794
#12 0x00007ffff41e72d0 in parsec_context_wait (context=0xea6a30) at /home/bouteill/parsec/dplasma-master/parsec/parsec/scheduling.c:1000
#13 0x0000000000405d53 in main (argc=46, argv=0x7ffffffea1a8) at /home/bouteill/parsec/dplasma-master/build.cuda/tests/testing_dpotrf.c:74
(gdb) list
2288                                         gpu_device->super.device_index, gpu_device->super.name,
2289                                         parsec_task_snprintf(tmp, MAX_TASK_STRLEN, this_task),
2290                                         gpu_copy, gpu_copy->super.super.obj_reference_count,
2291                                         i, original, current_readers);
2292                }
2293                assert(current_readers >= 0);
2294                if( (0 == current_readers) && !(flow->flow_flags & PARSEC_FLOW_ACCESS_WRITE) ) {
2295                     PARSEC_DEBUG_VERBOSE(20, parsec_gpu_output_stream,
2296                                         "GPU[%d:%s]:\tMake read-only copy %p [ref_count %d] available on flow %s",
2297                                         gpu_device->super.device_index, gpu_device->super.name, gpu_copy, gpu_copy->super.super.obj_reference_count, flow->name);

Oct 11 '24 19:10 abouteiller

I think we need to create a CI test that targets gpu_nvidia and issues the job to that runner, correct?

Jan 13 '25 18:01 G-Ragghianti

Failure in stress (and similar in nvlink) due to the code generating a pushback event when transferring the last tile between the GEMM -> DISCARD_C flow (m >= mt+1). This tile has no original->device_copies[0] because it was created directly without a backing DC (from a NEW in MAKE_C).

See further discussion in https://github.com/ICLDisco/parsec/pull/671#discussion_r1972238352

d@00000 GPU[1:cuda(0)]: Retrieve data (if any) for GEMM(79, 0, 0)[79, 0, 0]<0> keys = {4f, f000000000000001, 4f} {tp: 2} @parsec_device_kernel_scheduler:2719
d@00000 GPU[1:cuda(0)]: Try to Pop GEMM(79, 0, 0)[79, 0, 0]<0> keys = {4f, f000000000000001, 4f} {tp: 2} @parsec_device_kernel_pop:2264
d@00000 GPU[1:cuda(0)]: read copy 0x7ff06462f970 [ref_count 1] on flow A has readers (1) @parsec_device_kernel_pop:2323
d@00000 GPU[1:cuda(0)]: read copy 0x7ff064002c10 [ref_count 2] on flow C has readers (0) @parsec_device_kernel_pop:2323
d@00000 GPU[1:cuda(0)]: OUT Data copy 0x7ff064002c10 [ref_count 2] for flow C @parsec_device_kernel_pop:2330
Process 2891337 stopped
* thread #11, name = 'stress', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x70)
    frame #0: 0x00007ffff7eaa89b libparsec.so.4`parsec_device_kernel_pop(gpu_device=0x0000555555f7b7b0, gpu_task=0x00007ff06462a8c0, gpu_stream=0x0000555555f7bc68) at device_gpu.c:2341:17
   2338             if( gpu_task->pushout & (1 << i) ) {
   2339                 /* TODO: make sure no readers are working on the CPU version */
   2340                 original = gpu_copy->original;
-> 2341                 PARSEC_DEBUG_VERBOSE(10, parsec_gpu_output_stream,
   2342                                     "GPU[%d:%s]:\tMove D2H data <%s:%x> copy %p [ref_count %d] -- D:%p -> H:%p requested",
   2343                                     gpu_device->super.device_index, gpu_device->super.name, flow->name, original->key, gpu_copy, gpu_copy->super.super.obj_reference_count,
   2344                                      (void*)gpu_copy->device_private, original->device_copies[0]->device_private);

~Potential fix is to allocate a dev0copy like is done for the network received tiles, not sure why it doesn't already.~

Jan 13 '25 21:01 abouteiller

I get this error with TTG when constraining the device memory to 10%. The copy is being pushed out when we try to change ownership to the device.

testing_dpotrf_hip-parsec: /ccs/home/jschuchart/src/ttg/ttg/build_frontier/_deps/parsec-src/parsec/data.c:423: parsec_data_start_transfer_ownership_to_copy: Assertion `data->device_copies[i]->data_transfer_status != PARSEC_DATA_STATUS_UNDER_TRANSFER' failed.

Feb 22 '25 22:02 devreal

I merged with master but apparently I missed a defect in erroneous cases printouts, that causes the CI failures.

Feb 24 '25 22:02 abouteiller

I merged with master but apparently I missed a defect in erroneous cases printouts, that causes the CI failures.

ci failure due to using data_in uninitialized in new context nc during ontask, introduced by changes to task_snprintf in 108b7781b

Feb 26 '25 20:02 abouteiller

I get this error with TTG when constraining the device memory to 10%. The copy is being pushed out when we try to change ownership to the device.
testing_dpotrf_hip-parsec: /ccs/home/jschuchart/src/ttg/ttg/build_frontier/_deps/parsec-src/parsec/data.c:423: parsec_data_start_transfer_ownership_to_copy: Assertion `data->device_copies[i]->data_transfer_status != PARSEC_DATA_STATUS_UNDER_TRANSFER' failed.

Is this resolved with #733 ?

Mar 14 '25 18:03 abouteiller

I see dangling copies on the device. This might just require a fix in the cleanup code (ignoring data that only has the device copy):

[****] TIME(s)      7.35572 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  102044.227028 gflops - ENQ&PROG&DEST      7.75205 :   96827.061209 gflops - ENQ      0.39626 - DEST      0.00008
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11ca00000) and it is discarding it!
[****] TIME(s)      6.82278 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  110015.002854 gflops - ENQ&PROG&DEST      6.82296 :  110012.229403 gflops - ENQ      0.00009 - DEST      0.00008
[****] TIME(s)      6.70474 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111951.924658 gflops - ENQ&PROG&DEST      6.70490 :  111949.236353 gflops - ENQ      0.00009 - DEST      0.00007
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fded0a00000) and it is discarding it!
[****] TIME(s)      6.64373 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  112979.994900 gflops - ENQ&PROG&DEST      6.64388 :  112977.490061 gflops - ENQ      0.00009 - DEST      0.00006
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11aa00000) and it is discarding it!
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe138a00000) and it is discarding it!
[****] TIME(s)      6.71053 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111855.299517 gflops - ENQ&PROG&DEST      6.71068 :  111852.904738 gflops - ENQ      0.00009 - DEST      0.00005

Mar 26 '25 14:03 devreal

I get this error with TTG when constraining the device memory to 10%. The copy is being pushed out when we try to change ownership to the device.
testing_dpotrf_hip-parsec: /ccs/home/jschuchart/src/ttg/ttg/build_frontier/_deps/parsec-src/parsec/data.c:423: parsec_data_start_transfer_ownership_to_copy: Assertion `data->device_copies[i]->data_transfer_status != PARSEC_DATA_STATUS_UNDER_TRANSFER' failed.
Is this resolved with #733 ?

Yes, this is resolved.

Mar 26 '25 14:03 devreal

I see dangling copies on the device. This might just require a fix in the cleanup code (ignoring data that only has the device copy):

[****] TIME(s)      7.35572 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  102044.227028 gflops - ENQ&PROG&DEST      7.75205 :   96827.061209 gflops - ENQ      0.39626 - DEST      0.00008
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11ca00000) and it is discarding it!
[****] TIME(s)      6.82278 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  110015.002854 gflops - ENQ&PROG&DEST      6.82296 :  110012.229403 gflops - ENQ      0.00009 - DEST      0.00008
[****] TIME(s)      6.70474 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111951.924658 gflops - ENQ&PROG&DEST      6.70490 :  111949.236353 gflops - ENQ      0.00009 - DEST      0.00007
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fded0a00000) and it is discarding it!
[****] TIME(s)      6.64373 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  112979.994900 gflops - ENQ&PROG&DEST      6.64388 :  112977.490061 gflops - ENQ      0.00009 - DEST      0.00006
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11aa00000) and it is discarding it!
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe138a00000) and it is discarding it!
[****] TIME(s)      6.71053 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111855.299517 gflops - ENQ&PROG&DEST      6.71068 :  111852.904738 gflops - ENQ      0.00009 - DEST      0.00005

This warning is overcautious now that we have GPU-only copies created from the network. Ideally we would find a way to discriminate between the real leakages from the application and these temporaries being reclaimed.

Mar 28 '25 14:03 abouteiller

The original associated with these device owned copies should not have a valid dc ?

Mar 28 '25 18:03 bosilca

Here is what I think happens in the stress benchmark:

We allocate a new C tile, which has a host and device copy.
At the end of the GEMM task, the device copy gets passed through the reshape code and a reference is added to the device copy because that is the copy that is captured in the parsec_dep_data_description_t:

    data.data   = this_task->data._f_A.data_out;

We release the host copy at the end of release_deps_of_stress_GEMM. However, the host copy only has a single reference (the data) and so the host copy is detached from the parsec_data_t. Next time we use that parsec_data_t (in the next GEMM) we are missing the host copy.

I don't understand the reshape code and I was hoping to never have to touch it. I suspect that the reshape code was not designed with GPUs in mind but I could be wrong. Will need some help digging through this and figuring out

Whether capturing the device copy in reshape is correct.
How we can make sure that the host copy is not accidentally released.

Jul 14 '25 17:07 devreal