Parallelizing pcgsolve with openmp
We are using the provided pcg solver in kokkoskernels for our hydrological model. We found that when we activate openmp, the scaling does not look good.
We simply call run_pcg<Kokkos::OpenMP>(domsub.nCellDomain, statesub, domsub, timers); and in run_pcg, we call:
KokkosKernels::Experimental::Example::pcgsolve(kh, matA, vecB, vecX, cg_iteration_limit, cg_iteration_tolerance, & cg_result, true, clusterSize, useSequential);
Kokkos::fence();
The pcgsolve takes about 90% of the computational time and it doesn't scale well when we add threads. But since it is provided by kokkoskernels, it is difficult for us to find out what might be the reason. We were wondering if there are additional steps we are missing that can enhance scaling performance of the pcg solver?
We also want to ask if you have any recommendations on choosing an linear system solver (perhaps also a nonlinear solver) that is compatible with kokkos and has good scaling performance when using openmp and mpi?
Here I attached a figure that shows the speedup of our model (the blue dots), using the pcgsolve. Thank you so much in advance!

@jennloe
Actually a good start would be to re-run this with finer timers, for instance using Kokkos-tools simple-kernel-timer. You just need to build the tool and then set: export KOKKOS_PROFILE_LIBRARY=${HOME}/kokkos-tools/profiling/simple-kernel-timer/kp_kernel_timer.so
I will start looking at this...
@zLi90 What size problem were you running on to get these timings? How sparse is the matrix? What was the convergence tolerance? How many iterations were you running? And what size of problem do you ultimately want to be able to run?
@jennloe We are running a 3D transient subsurface flow model. This test case has 40x24x32=30720 grid cells, resulting a 7-diagonal matrix. The tolerance is 1e-8 and it takes (more or less) 18 pcg iterations for each time step (we ran a total of 900 seconds with a time step dt=60s). We hope to be able to run large scale hydrological simulations that contain 1 million grid cells or more. Thanks!
Hi @jennloe , just wondering if you have figured out something? Do you have any suggestions on what we could try next? Thanks!
I'm sorry; I have not had more time to look at this, and I will not have time in the next few weeks. Thank you for your patience. @brian-kelley would you be able to look into this scaling?
I like Luc's comment above. @zLi90 can you enable the finer timers please?
@srajama1 @lucbv Thanks for the suggestion! I have tried the simple-kernel-timer but I am not sure how to decode the output file (or perhaps I did something wrong?)
For example, I got an output file with the first few lines like this:
ÄÜ7@H
Ok I found the reader provided. This is useful indeed! Here are the top 5 time-consuming sections:
with 1 thread:
- SubsurfaceModel::computeFaceConductivity(SubsurfaceState&, DomainSubsurface&)::{lambda(int)#1} (ParFor) 1.248947 30 0.041632 29.454716 23.021248
- SubsurfaceModel::wcRedistribution(SubsurfaceState&, DomainSubsurface&)::{lambda(int)#1} (ParFor) 1.213125 15 0.080875 28.609900 22.360956
- KokkosSparse::spmv<NoTranspose,Static> (ParFor) 0.987691 231 0.004276 23.293334 18.205629
- KokkosBlas::Axpby::S15 (ParFor) 0.214088 663 0.000323 5.048977 3.946185
- KokkosBlas::dot<1D> (ParRed) 0.205637 678 0.000303 4.849678 3.790417
with 2 threads:
- SubsurfaceModel::computeFaceConductivity(SubsurfaceState&, DomainSubsurface&)::{lambda(int)#1} (ParFor) 0.626923 30 0.020897 28.184585 18.427453
- SubsurfaceModel::wcRedistribution(SubsurfaceState&, DomainSubsurface&)::{lambda(int)#1} (ParFor) 0.585216 15 0.039014 26.309592 17.201558
- KokkosSparse::spmv<NoTranspose,Static> (ParFor) 0.498451 231 0.002158 22.408882 14.651222
- KokkosBlas::Axpby::S15 (ParFor) 0.142773 663 0.000215 6.418669 4.196611
- KokkosBlas::dot<1D> (ParRed) 0.141340 678 0.000208 6.354218 4.154472
It seems that the last two items (Axpby and dot) don't scale well (from 0.2s to 0.14s)? I believe they belong to the pcgsolve of KokkosKernels.
@zLi90 Do you compile with any BLAS enabled? We have the implementation routines of in Kokkos Kernels. However, one should use vendor BLAS when available. What is the platform and compiler combination?
@srajama1 I am using Mac+gcc. I just use the default settings when building kokkos kernels. Do you mean I should add -DKokkosKernels_ENABLE_TPL_BLAS=ON when building?