Adding dplasma CI as tester for parsec
~~Need this PR for testing, do not merge.~~
This adds a github workflow yaml in order to test parsec via dplasma. The YAML comes from the CI for dplasma with slight changes. Basically, from the checked-out parsec source, we create a clone of dplasma and execute the CI test of dplasma from that subdirectory. In order to use the newest sources of parsec instead of the dplasma submodule, we remove the parsec subdirectory in dplasma and replace it with a symlink to the parent directory.
@abouteiller Is this what you had in mind?
I'm hitting an error where dplasma can't find a header from parsec. The parsec build isn't installing it:
2025-05-07T20:55:12.6393336Z /tmp/parsec/parsec/dplasma/build/src/dplrnt_wrapper.c:14:10: fatal error: parsec/data_dist/matrix/apply.h: No such file or directory
2025-05-07T20:55:12.6394495Z 14 | #include "parsec/data_dist/matrix/apply.h"
IT's an auto-generated header from the parsec/data_dist/matrix/apply.jdf, and these don't get installed by default. I need to look a little in the code to see how we can get at least this one installed.
I had a look at the cmake config, and I think I see where it would need to be added. I can create a patchfile to include in this PR (just for testing), and if that's all that's required, we can make the change in the dplasma repo. Sound good?
Edit: I forgot which repo this was in, so no patch is needed. I just added something that results in the apply.h being installed.
This feature is already finding issues, great :D
I also found a problem in the CMakeLists.txt for DPLASMA where MPIEXEC_NUMPROC_FLAGS was used instead of MPIEXEC_NUMPROC_FLAG (resulting in a failure to launch the mpi job). Now it looks like the only errors are due to "suspicious" solutions.
Found another problem where dplasma can't find HIP if using an external parsec because the default rocm directory /opt/rocm isn't added to CMAKE_SYSTEM_PREFIX_PATH in PaRSECConfig.cmake. I just added it to verify the source of the problem, but I'm not sure if that is the "correct" way.
All tests are now running, and the only remaining problems are in incorrect tester results or tester segfaults.
60: Test command: /apps/spacks/2024-07-19/github_env/var/spack/environments/dplasma/.spack-env/view/bin/mpiexec "-n" "4" "./testing_dgetrf_1d" "-N" "378" "-t" "19" "-P" "1" "-x" "-v=5"
60: Working Directory: /tmp/parsec/build/dplasma/build/tests
60: Environment variables:
60: PARSEC_MCA_device_cuda_enabled=0
60: PARSEC_MCA_device_hip_enabled=0
60: PARSEC_MCA_device_level_zero_enabled=0
60: PARSEC_MCA_device_cuda_memory_use=10
60: PARSEC_MCA_device_hip_memory_use=10
60: PARSEC_MCA_device_level_zero_memory_use=10
60: Test timeout computed to be: 1500
60: #+++++ cores detected : 36
60: #+++++ nodes x cores + gpu : 4 x 36 + 0 (144+0)
60: #+++++ thread mode : THREAD_SERIALIZED
60: #+++++ P x Q : 1 x 4 (4/4)
60: #+++++ M x N x K|NRHS : 378 x 378 x 1
60: #+++++ LDA , LDB : 378 , 378
60: #+++++ MB x NB , IB : 19 x 19 , 40
60: [ 0] TIME(s) 0.03314 : PaRSEC initialized
60: [ 2] TIME(s) 0.03323 : PaRSEC initialized
60: [ 3] TIME(s) 0.03325 : PaRSEC initialized
60: [ 1] TIME(s) 0.03325 : PaRSEC initialized
60: W@00000 /!\ PERFORMANCE MIGHT BE REDUCED /!\: Multiple PaRSEC processes on the same node may share the same physical core(s);
60: This is often unintentional, and will perform poorly.
60: Note that in managed environments (e.g., ALPS, jsrun), the launcher may set `cgroups`
60: and hide the real binding from PaRSEC; if you verified that the binding is correct,
60: this message can be silenced using the MCA argument `runtime_warn_slow_binding`.
60: +++ Generate matrices ... Done
60: +++ Generate matrices ... Done
60: +++ Generate matrices ... Done
60: +++ Generate matrices ... Done
60: +++ Computing getrf ... Done.
60: +++ Computing getrf ... Done.
60: +++ Computing getrf ... Done.
60: +++ Computing getrf ... [****] TIME(s) 0.73037 : dgetrf_1d PxQxg= 1 4 0 NB= 19 N= 378 : 0.049202 gflops - ENQ&PROG&DEST 0.73120 : 0.049146 gflops - ENQ 0.00074 - DEST 0.00008
60: <DartMeasurement name="performance" type="numeric/double"
60: encoding="none" compression="none">
60: 0.0492018
60: </DartMeasurement>
60: Done.
60: ============
60: Checking the Residual of the solution
60: -- ||A||_oo = 1.025373e+02, ||X||_oo = 2.768754e+00, ||B||_oo= 5.000000e-01, ||A X - B||_oo = 4.776786e+00
60: -- ||Ax-B||_oo/((||A||_oo||x||_oo+||B||_oo).N.eps) = 4.002243e+11
60: -- Solution is suspicious !