kokkos-kernels Parallelizing pcgsolve with openmp

We are using the provided pcg solver in kokkoskernels for our hydrological model. We found that when we activate openmp, the scaling does not look good.

We simply call run_pcg<Kokkos::OpenMP>(domsub.nCellDomain, statesub, domsub, timers); and in run_pcg, we call:

KokkosKernels::Experimental::Example::pcgsolve(kh, matA, vecB, vecX, cg_iteration_limit, cg_iteration_tolerance, & cg_result, true, clusterSize, useSequential); Kokkos::fence();

The pcgsolve takes about 90% of the computational time and it doesn't scale well when we add threads. But since it is provided by kokkoskernels, it is difficult for us to find out what might be the reason. We were wondering if there are additional steps we are missing that can enhance scaling performance of the pcg solver?

We also want to ask if you have any recommendations on choosing an linear system solver (perhaps also a nonlinear solver) that is compatible with kokkos and has good scaling performance when using openmp and mpi?

Here I attached a figure that shows the speedup of our model (the blue dots), using the pcgsolve. Thank you so much in advance! Screen Shot 2021-10-26 at 8 42 24 PM

Oct 26 '21 13:10 zLi90

@jennloe

Oct 28 '21 18:10 srajama1

Actually a good start would be to re-run this with finer timers, for instance using Kokkos-tools simple-kernel-timer. You just need to build the tool and then set: export KOKKOS_PROFILE_LIBRARY=${HOME}/kokkos-tools/profiling/simple-kernel-timer/kp_kernel_timer.so

Oct 28 '21 18:10 lucbv

I will start looking at this...

Oct 29 '21 01:10 jennloe

@zLi90 What size problem were you running on to get these timings? How sparse is the matrix? What was the convergence tolerance? How many iterations were you running? And what size of problem do you ultimately want to be able to run?

Oct 29 '21 01:10 jennloe

@jennloe We are running a 3D transient subsurface flow model. This test case has 40x24x32=30720 grid cells, resulting a 7-diagonal matrix. The tolerance is 1e-8 and it takes (more or less) 18 pcg iterations for each time step (we ran a total of 900 seconds with a time step dt=60s). We hope to be able to run large scale hydrological simulations that contain 1 million grid cells or more. Thanks!

Oct 29 '21 02:10 zLi90

Hi @jennloe , just wondering if you have figured out something? Do you have any suggestions on what we could try next? Thanks!

Dec 08 '21 15:12 zLi90

I'm sorry; I have not had more time to look at this, and I will not have time in the next few weeks. Thank you for your patience. @brian-kelley would you be able to look into this scaling?

Dec 10 '21 16:12 jennloe

I like Luc's comment above. @zLi90 can you enable the finer timers please?

Dec 10 '21 16:12 srajama1

@srajama1 @lucbv Thanks for the suggestion! I have tried the simple-kernel-timer but I am not sure how to decode the output file (or perhaps I did something wrong?)

For example, I got an output file with the first few lines like this:

ÄÜ7@H

Dec 12 '21 15:12 zLi90

Ok I found the reader provided. This is useful indeed! Here are the top 5 time-consuming sections:

with 1 thread:

SubsurfaceModel::computeFaceConductivity(SubsurfaceState&, DomainSubsurface&)::{lambda(int)#1} (ParFor) 1.248947 30 0.041632 29.454716 23.021248
SubsurfaceModel::wcRedistribution(SubsurfaceState&, DomainSubsurface&)::{lambda(int)#1} (ParFor) 1.213125 15 0.080875 28.609900 22.360956
KokkosSparse::spmv<NoTranspose,Static> (ParFor) 0.987691 231 0.004276 23.293334 18.205629
KokkosBlas::Axpby::S15 (ParFor) 0.214088 663 0.000323 5.048977 3.946185
KokkosBlas::dot<1D> (ParRed) 0.205637 678 0.000303 4.849678 3.790417

with 2 threads:

SubsurfaceModel::computeFaceConductivity(SubsurfaceState&, DomainSubsurface&)::{lambda(int)#1} (ParFor) 0.626923 30 0.020897 28.184585 18.427453
SubsurfaceModel::wcRedistribution(SubsurfaceState&, DomainSubsurface&)::{lambda(int)#1} (ParFor) 0.585216 15 0.039014 26.309592 17.201558
KokkosSparse::spmv<NoTranspose,Static> (ParFor) 0.498451 231 0.002158 22.408882 14.651222
KokkosBlas::Axpby::S15 (ParFor) 0.142773 663 0.000215 6.418669 4.196611
KokkosBlas::dot<1D> (ParRed) 0.141340 678 0.000208 6.354218 4.154472

It seems that the last two items (Axpby and dot) don't scale well (from 0.2s to 0.14s)? I believe they belong to the pcgsolve of KokkosKernels.

Dec 12 '21 15:12 zLi90

@zLi90 Do you compile with any BLAS enabled? We have the implementation routines of in Kokkos Kernels. However, one should use vendor BLAS when available. What is the platform and compiler combination?

Dec 16 '21 04:12 srajama1

@srajama1 I am using Mac+gcc. I just use the default settings when building kokkos kernels. Do you mean I should add -DKokkosKernels_ENABLE_TPL_BLAS=ON when building?

Dec 20 '21 17:12 zLi90