[Issue] Solver reports wrong vector size in parallel, whereas simple SpMV is correct

Open YvanFournier opened this issue 6 months ago • 1 comments

Describe the issue

Using a matrix loaded with the AMGX_matrix_upload_distributed function, I obtain the following message on each MPI rank :

Caught amgx exception: Vector size too small: not enough space for halo elements.
Vector: {tag = 1, size = 139264}
Required size: 140760

 at: /home/D43345/src/AMGX/src/distributed/comms_visitors3.cu:23
Stack trace:
 /home/D43345/opt/amgx-2.5/arch/nvhpc-byo/lib/libamgxsh.so : ()+0x3bff620
 /home/D43345/opt/amgx-2.5/arch/nvhpc-byo/lib/libamgxsh.so : amgx::ExcHalo2Functor<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > >::operator()(amgx::CommsMPIHostBufferStream<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&)+0x310

This is done in the code_saturne CFD code's AmgX wrapper (https://github.com/code-saturne/code_saturne/blob/master/src/alge/cs_sles_amgx.cpp), which, apart from a recent migration from C to C++, has not changed much in the last 2 years, and did work at one point 2 or 3 years ago (though possibly on a different test case). The AmgX call sequence is actually quite similar to that in PETSc (I compared the 2 looking for incorrect code on my side).

Testing the SpMV product alone prior to solving the system, everything seems good and the result is the same as that of a reference (built-in code). After stepping through here:

709     template<AMGX_Mode CASE,
710              template<typename> class SolverType,
711              template<typename> class VectorType>
712     inline AMGX_ERROR solve_with(AMGX_solver_handle slv,
713                                  AMGX_vector_handle rhs,
714                                  AMGX_vector_handle sol,
715                                  Resources *resources,
716                                  bool xIsZero = false)
717     {
718         typedef SolverType<typename TemplateMode<CASE>::Type> SolverLetterT;
719         typedef CWrapHandle<AMGX_solver_handle, SolverLetterT> SolverW;
720         typedef VectorType<typename TemplateMode<CASE>::Type> VectorLetterT;
721         typedef CWrapHandle<AMGX_vector_handle, VectorLetterT> VectorW;
722         SolverW wrapSolver(slv);
723         SolverLetterT &solver = *wrapSolver.wrapped();
724         //AMGX_STATUS& slv_stat = wrapSolver.last_solve_status(slv);
725         VectorW wrapRhs(rhs);
726         VectorLetterT &b = *wrapRhs.wrapped();
727         VectorW wrapSol(sol);
728         VectorLetterT &x = *wrapSol.wrapped();

Checking the vector sizes:

(gdb) p v_rhs.size()
$4 = 139264
(gdb) p v_x.size()
$5 = 140760

The vector size is larger than the RHS size, which accounts for the ghost cells correctly, and matches the expected sizes in the calling code.

But when solving, after stepping through here:

709     template<AMGX_Mode CASE,
710              template<typename> class SolverType,
711              template<typename> class VectorType>
712     inline AMGX_ERROR solve_with(AMGX_solver_handle slv,
713                                  AMGX_vector_handle rhs,
714                                  AMGX_vector_handle sol,
715                                  Resources *resources,
716                                  bool xIsZero = false)
717     {
718         typedef SolverType<typename TemplateMode<CASE>::Type> SolverLetterT;
719         typedef CWrapHandle<AMGX_solver_handle, SolverLetterT> SolverW;
720         typedef VectorType<typename TemplateMode<CASE>::Type> VectorLetterT;
721         typedef CWrapHandle<AMGX_vector_handle, VectorLetterT> VectorW;
722         SolverW wrapSolver(slv);
723         SolverLetterT &solver = *wrapSolver.wrapped();
724         //AMGX_STATUS& slv_stat = wrapSolver.last_solve_status(slv);
725         VectorW wrapRhs(rhs);
726         VectorLetterT &b = *wrapRhs.wrapped();
727         VectorW wrapSol(sol);
728         VectorLetterT &x = *wrapSol.wrapped();

Although the code is very similar, when I check for the RHS and solution sizes, I get the following result

(gdb) p x.size()
$3 = 139264
(gdb) p b.size()
$4 = 140760

So it seems the 2 sizes are inverted, leading to the error in the check preceding the actual ghost cell exchange.

I managed to get this far, but am not familiar enough with the wrapper mechanisms to really understand where the error comes from. But I do reproduce it on several systems (laptop and cluster).

Environment information:

OS: Linux, reproduced on, Arch Linux Debian 12, and RHEL 8.
CUDA runtime: 12.8 and 12.4 at least
Native OpenMPI 5 (CUDA-aware) or 4 (non-CUDA aware)
AMGX 666311a52c36d75f6738ecc15897840256d2cf79
NVIDIA GPU: RTX4600 mobile, A500, V100

AMGX solver configuration

Tested both with PCG_CLASSICAL_V_JACOBI.json and no configuration.

Matrix Data

code_saturne's BUNDLE test case (https://github.com/code-saturne/saturne-open-cases/tree/main/BUNDLE), cases BENCH_C16_04 (coarse) on 2 MPI ranks can be adapted for this, but this requires building the code and positioning the following environment variables.

CS_BENCH_NO_PERIODICITY=1
CS_BENCH_SLES_TYPE=amgx

I do not have an easy/standard export mechanism for this system.

Reproduction steps

If your AMGX workflow differs from one of AMGX examples - provide minimal reproducible example for the reported issue

Additional context

Add any other context about the problem here.

Jun 11 '25 21:06 YvanFournier

Follow-up: in the calling code, uploading the solution vector to a size matching the number of columns (instead of rows), seems to work around this.

It is strange though that copying halo data which will be overwritten anyways when synced should be necessary, so an API function allowing the user to provide a hint for the correct size would be nice.

Jun 11 '25 23:06 YvanFournier