How to use AMGX_matrix_upload_distributed with external MPI initialization
Good day. I am trying to apply the distributed matrix loading approach from the example AMGX / examples / amgx_mpi_capi_cla.c in my code. The peculiarity of my code is that I have MPI initialization in external code. On each processor, I work with my own piece of the grid, and I use AMGX only to calculate the SLAE. Can you please tell me how in this case it is worth modifying the example to make it work? I tried it in different ways. In my current configuration, I have it like this: MPI_Comm_dup (MPI_COMM_WORLD, & amgx_mpi_comm); / * MPI init (with CUDA GPUs) * / // MPI MPI_Comm_size (amgx_mpi_comm, & nranks); MPI_Comm_rank (amgx_mpi_comm, & rank); // CUDA GPUs CUDA_SAFE_CALL (cudaGetDeviceCount (& gpu_count)); lrank = rank% gpu_count; CUDA_SAFE_CALL (cudaSetDevice (lrank)); printf ("Process% d selecting device% d \ n", rank, lrank);
// app must know how to provide a mapping
AMGX_resources_create (& rsrc, config, & amgx_mpi_comm, 1, & lrank);
if (partition_vector == NULL)
{
// If no partition vector is given, we assume a partitioning with contiguous blocks (see example above). It is sufficient (and faster/more scalable)
// to calculate the partition offsets and pass those into the API call instead of creating a full partition vector.
int64_t* partition_offsets = (int64_t*)malloc((nranks + 1) * sizeof(int64_t));
// gather the number of rows on each rank, and perform an exclusive scan to get the offsets.
int64_t n64 = n;
partition_offsets[0] = 0; // rows of rank 0 always start at index 0
MPI_Allgather(&n64, 1, MPI_INT64_T, &partition_offsets[1], 1, MPI_INT64_T, amgx_mpi_comm);
for (int i = 2; i < nranks + 1; ++i) {
partition_offsets[i] += partition_offsets[i - 1];
}
nglobal = partition_offsets[nranks]; // last element always has global number of rows
AMGX_distribution_handle dist;
AMGX_distribution_create(&dist, config);
AMGX_distribution_set_partition_data(dist, AMGX_DIST_PARTITION_OFFSETS, partition_offsets);
AMGX_matrix_upload_distributed(A, nglobal, n, nnz, block_dimx, block_dimy, row_ptrs, col_indices, values, diag, dist);
AMGX_distribution_destroy(dist);
free(partition_offsets);
}
I run execution on 2 processors:
mpiexec -n 2 <application_name>.exe The log is like this: Process 0 selecting device 0 ... AMGX version 2.1.0.131-opensource Built on Aug 17 2020, 11:01:51 Compiled with CUDA Runtime 11.0, using CUDA driver 11.3 Using Normal MPI (Hostbuffer) communicator...
job aborted: rank: node: exit code[: error message] 0: LAPTOP-L6OI9HRJ: -1073741819: process 0 exited without calling finalize 1: LAPTOP-L6OI9HRJ: -1073741819: process 1 exited without calling finalize
Maybe someone came across such a situation. Any help would be welcome.
ps I read https://github.com/NVIDIA/AMGX/issues/81, I have provided the specifics of matrix formation for this case.