How to use AMGX_matrix_upload_distributed with external MPI initialization

Open vmasyagin opened this issue 4 years ago • 0 comments

Good day. I am trying to apply the distributed matrix loading approach from the example AMGX / examples / amgx_mpi_capi_cla.c in my code. The peculiarity of my code is that I have MPI initialization in external code. On each processor, I work with my own piece of the grid, and I use AMGX only to calculate the SLAE. Can you please tell me how in this case it is worth modifying the example to make it work? I tried it in different ways. In my current configuration, I have it like this: MPI_Comm_dup (MPI_COMM_WORLD, & amgx_mpi_comm); / * MPI init (with CUDA GPUs) * / // MPI MPI_Comm_size (amgx_mpi_comm, & nranks); MPI_Comm_rank (amgx_mpi_comm, & rank); // CUDA GPUs CUDA_SAFE_CALL (cudaGetDeviceCount (& gpu_count)); lrank = rank% gpu_count; CUDA_SAFE_CALL (cudaSetDevice (lrank)); printf ("Process% d selecting device% d \ n", rank, lrank);

     // app must know how to provide a mapping
     AMGX_resources_create (& rsrc, config, & amgx_mpi_comm, 1, & lrank);

      if (partition_vector == NULL)
                  {
                      // If no partition vector is given, we assume a partitioning with contiguous blocks (see example above). It is sufficient (and faster/more scalable)
            // to calculate the partition offsets and pass those into the API call instead of creating a full partition vector.
            int64_t* partition_offsets = (int64_t*)malloc((nranks + 1) * sizeof(int64_t));
            // gather the number of rows on each rank, and perform an exclusive scan to get the offsets.
            int64_t n64 = n;
            partition_offsets[0] = 0; // rows of rank 0 always start at index 0
            MPI_Allgather(&n64, 1, MPI_INT64_T, &partition_offsets[1], 1, MPI_INT64_T, amgx_mpi_comm);
            for (int i = 2; i < nranks + 1; ++i) {
                partition_offsets[i] += partition_offsets[i - 1];
            }
            nglobal = partition_offsets[nranks]; // last element always has global number of rows

            AMGX_distribution_handle dist;
            AMGX_distribution_create(&dist, config);
            AMGX_distribution_set_partition_data(dist, AMGX_DIST_PARTITION_OFFSETS, partition_offsets);
            AMGX_matrix_upload_distributed(A, nglobal, n, nnz, block_dimx, block_dimy, row_ptrs, col_indices, values, diag, dist);
            AMGX_distribution_destroy(dist);
            free(partition_offsets);
        }

I run execution on 2 processors:

mpiexec -n 2 <application_name>.exe The log is like this: Process 0 selecting device 0 ... AMGX version 2.1.0.131-opensource Built on Aug 17 2020, 11:01:51 Compiled with CUDA Runtime 11.0, using CUDA driver 11.3 Using Normal MPI (Hostbuffer) communicator...

job aborted: rank: node: exit code[: error message] 0: LAPTOP-L6OI9HRJ: -1073741819: process 0 exited without calling finalize 1: LAPTOP-L6OI9HRJ: -1073741819: process 1 exited without calling finalize

Maybe someone came across such a situation. Any help would be welcome.

ps I read https://github.com/NVIDIA/AMGX/issues/81, I have provided the specifics of matrix formation for this case.

May 21 '21 12:05 vmasyagin