kokkos-kernels icon indicating copy to clipboard operation
kokkos-kernels copied to clipboard

Batched gemms: available performance knobs

Open bandokihiro opened this issue 3 years ago • 32 comments

In my application, I have many matrix-matrix multiplications to do. I have a view of N square left matrices (each with O(10-100) rows and columns) and a view of N right matrices. There is a one-to-one match between these. I am using the team batched gemm facility in kokkos-kernels, launching N Kokkos teams. I would like to understand what performance knobs I have available for the CUDA backend. Besides the data layout of the two views mentioned previously (which I am sure is important for performance) I have currently identified the following:

  • the second argument of the TeamPolicy constructor (currently set as Kokkos::Auto)
  • the arguments of the parallel for besides the lambda (I currently have Kokkos::Experimental::WorkItemProperty::HintLightWeight and I am asking the original writer of the code why)
  • I am currently using a per-team scratch of size the result of one matrix-matrix multiply. Then copying this using a TeamCopy to overwrite the view of the right matrices (since the left matrices are square). Is that needed?
  • the ArgAlgo template parameter for TeamGemm (currently set as Unblocked, I tried Blocked and the performance was significantly worse)

Is there anything else I should pay attention to? More generally, does this sound like a good use of the TeamGemm? I haven't considered for instance using vector-level parallelism, or using a one Kokkos thread to one matrix multiply mapping (unless Kokkos::Auto happens to give me 1; in fact, for some tests I was looking at, I get 96 threads per team which seems to be a lot if assuming that a thread = a warp)

bandokihiro avatar Nov 05 '21 19:11 bandokihiro

One thing to consider with these tuning problems is the use of Kokkos Autotuning. If the TeamPolicy constructor uses Kokkos::AUTO for team size (or vector length), autotuning will allow you to use tools like Apollo (note: use the branch I linked if you want to try this out) to try different valid choices for team size or vector length and use the best one. You would need to compile Kokkos with -DKokkos_ENABLE_TUNING=ON, and run your app with --kokkos-tune-internals (or set the environment variable KOKKOS_TUNE_INTERNALS to 1), and load a tool like Apollo. Let me know if you want to go down that road, we have a lot of training material on the topic.

DavidPoliakoff avatar Nov 08 '21 13:11 DavidPoliakoff

yes, I would be interested in that. I am on 3.3.00 for both kokkos and kokkos-kernels, would this work? does it also give detailed metrics (kind of like nvprof) or just tries a bunch of different sizes and report wall-time?

bandokihiro avatar Nov 08 '21 16:11 bandokihiro

I think 3.3 is the first release in which this is supported (edit: where "this" is team size tuning. Future releases add other kinds of tuning).

Basically what happens is that you (or Kokkos Kernels) probably have

while(no_heat_death_of_universe){
  parallel_for("do_math",Kokkos::TeamPolicy<>(42, Kokkos::AUTO), KOKKOS_LAMBDA(member_type) {
    // insert math here
  });
}

A tool will be loaded, and every time you launch that kernel, it will be asked "hey, what are the right team size and vector length?" Because you repeat the process many times, the tool will slowly learn what the best configuration will be. So it tries a bunch, and reports wall time, and then uses the optimal configuration.

Most tools now support reusing an optimal configuration, so if you run your application multiple times, it can reuse the results from previous runs.

DavidPoliakoff avatar Nov 08 '21 16:11 DavidPoliakoff

so the actual code I am writing must repeat the parallel-for? also I would be interested in the training material you mentioned earlier.

bandokihiro avatar Nov 08 '21 16:11 bandokihiro

Right, it must run the parallel-for multiple times. Our most recent training material is here under Autotuning, and we have some older stuff above, here

DavidPoliakoff avatar Nov 08 '21 16:11 DavidPoliakoff

Thanks

bandokihiro avatar Nov 08 '21 16:11 bandokihiro

Is it supposed to be a pain to build? I am running through these dependecies: Apollo > callpath > adept-utils > llnl-hires-times.

And adept-utils complains that I don't have boost on the computer I am trying to install.

bandokihiro avatar Nov 08 '21 17:11 bandokihiro

Ok, so I ran with Apollo which output its .yaml files and ran a second time using nvprof with the .yaml files in the current directory. It ran with a vector length of 1 (Kokkos' default) and 1024 Kokkos threads which resulted in block sizes of [1,1024,1]. This answers the question, given a problem size and a TeamPolicy instance using just two arguments for its constructor (namely the number of teams and Kokkos::Auto), what is the best team size. I am still trying to understand if the TeamPolicy construct as well as the bacthed gemm usage are the best given my problem statement.

bandokihiro avatar Nov 08 '21 23:11 bandokihiro

@bandokihiro There is no one answer that fits every use case here. It is going to depend on what is inside the parallel-for. Is it only GEMM that is inside this loop. I guess not. What others ops do you do inside this loop?

For small problems, it might be the case a rangepolicy with serialgemm could be beneficial. The other factor that might play a role is whether you use any shared memory which would automatically require a team policy.

srajama1 avatar Nov 09 '21 01:11 srajama1

So I was hoping for my use case to be precise enough so as to restrict the number of performance knobs. More precisely (see also the original post of this issue), I have N matrix-matrix multiplications to perform. I have a view of the N left square matrices and a view of the N right matrices. I can overwrite the right view. Each left matrix has O(100) rows and columns. N can be large (think of the number of elements in a CFD code that sits on one GPU). My current pattern is as follows:

  • launch N teams, each team works on a single matrix-matrix multiply
  • use a per-team scratch of the same size as a single right matrix to temporarilly hold the result of one matrix-matrix multiplication
  • copy the result into the original right view using a TeamCopy

So I am not exploiting any explicit vector level parallelism. I learnt a way through auto-tuning to see what is best in place of Kokkos::Auto. Now I would like to list the things I should look into besides data layout of the big views to target optimization avenues.

bandokihiro avatar Nov 09 '21 04:11 bandokihiro

@bandokihiro I would like to understand your use case more:

  • "view of the N left square matrices" means view with LayoutLeft? Kokkos::View<double***, Kokkos::LayoutLeft> A ( "A", m, m, N );
  • and "view of the N right square matrices" means LayoutRight? Kokkos::View<double***, Kokkos::LayoutRight> B ( "B", N, m, k );
  • and you want B=A(LayoutLeft)*B(LayoutRight)?

vqd8a avatar Nov 09 '21 15:11 vqd8a

No my views are imposed by other aspects of the code which makes it sub-optimal for the operation I am describing. I am aware of this and am trying to understand how far I can go in terms of performance while keeping the current layout.

Currently, my view of N left square matrices is Kokkos::View<double***, Kokkos::LayoutRight> A("A", N, m, m) (this one is flexible) My view of the N right matrices is Kokkos::View<double***, Kokkos::LayoutRight> B("B", m, k, N) (this one is imposed by the rest of the code)

and I need to do B(:,:,i) = A(i,:,:) * B(:,:,i) for all i

bandokihiro avatar Nov 09 '21 16:11 bandokihiro

@bandokihiro Oh, I misunderstood the words "left" and "right". Thanks for clarifying.

In our evaluations, we found that TeamGemm with Unblocked algorithm gives the best performance on LayoutRight views with batch size as the first dimension of views. We want to preserve the locality of each matrix in the batch. In your case, since N is at the last dimension of B, performance will drop as compared to when A and B both have the same view style. Can you try TeamVectorGemm to see if it can help improve performance or not?

Moreover, per-team scratch memory to hold the entire matrix B(:,:,i) is used to do "in-place" gemms, which also limits the number of active teams on each GPU's multiprocessor. TeamGemm does not support "in-place" operations (and if I remember correctly, other our batched gemm algorithms have not supported in-place either. @e10harvey, correct me if I am wrong). Is it possible for you to use a temporary view C to hold A*B and copy C to B at the end? How slow it is?

vqd8a avatar Nov 09 '21 18:11 vqd8a

Is it possible for you to use a temporary view C to hold A*B and copy C to B at the end? How slow it is?

How different is that from what I am currently doing with a per-team scrach? Do you mean a view C of the same shape as B?

bandokihiro avatar Nov 09 '21 18:11 bandokihiro

Yes, use a view C of the same shape as B. It cost extra deep_copy from C to B but you can use our optimized host-level interface, which might give better performance: https://github.com/kokkos/kokkos-kernels/blob/a327b28711948c361b93e92b41c5e6e9363f9d3b/src/batched/dense/KokkosBatched_Gemm_Decl.hpp#L185-L230

vqd8a avatar Nov 09 '21 18:11 vqd8a

If I allocate space for a C then I don't need to copy back to B so that's not a problem

bandokihiro avatar Nov 09 '21 18:11 bandokihiro

Just to be sure: do you have to use B to store the results or using another view C for results is okay in your code?

vqd8a avatar Nov 09 '21 18:11 vqd8a

Using another view is OK. I would like to avoid allocating it. But currently, the performance of this kernel is pretty bad compared to the rest of the code so if I can get more performance at the cost of a higher memory usage, that is fine by me. I tested it and it is even slower than the scratch strategy.

bandokihiro avatar Nov 09 '21 18:11 bandokihiro

@bandokihiro Can you provide me an example code of your problem? I will look at it and see if I can improve anything.

vqd8a avatar Nov 09 '21 19:11 vqd8a

Here is my baseline version


using DeviceType = Kokkos::Cuda;
using ExecutionSpaceType = DeviceType::execution_space;

using RangePolicyType = Kokkos::RangePolicy<
    ExecutionSpaceType,
    Kokkos::Schedule<Kokkos::Static> >;

using Kokkos1DView =  Kokkos::View<
    double *,
    Kokkos::LayoutRight,
    DeviceType::memory_space>;
using Kokkos3DView = Kokkos::View<
    double ***,
    Kokkos::LayoutRight,
    DeviceType::memory_space>;
using ScratchView2 = Kokkos::View<
    rtype **,
    Kokkos::LayoutRight,
    DeviceType::memory_space,
    Kokkos::MemoryTraits<Kokkos::Unmanaged> >;

void team_policy_v1(
    const long long unsigned ne,
    const unsigned nb,
    const unsigned ns,
    const unsigned niter) {

    using TeamPolicyType = Kokkos::TeamPolicy<ExecutionSpaceType>;
    using MemberType = TeamPolicyType::member_type;

    // allocate
    Kokkos1DView iMM1D("IMM", ne * nb * nb); // ne left matrices
    Kokkos1DView res1D("res", nb * ns * ne); // ne right matrices
    // derived
    Kokkos3DView iMM3D(iMM1D.data(), ne, nb, nb);
    Kokkos3DView res3D(res1D.data(), nb, ns, ne);

    // random initialization
    uint64_t seed = Kokkos::Impl::clock_tic();
    Kokkos::Random_XorShift64_Pool <ExecutionSpaceType> rand_pool(seed);
    Kokkos::fill_random(iMM3D, rand_pool,
        Kokkos::rand < Kokkos::Random_XorShift64 < ExecutionSpaceType > , double > ::max());
    Kokkos::fill_random(res3D, rand_pool,
        Kokkos::rand < Kokkos::Random_XorShift64 < ExecutionSpaceType > , double > ::max());

    // warm-up, then sync before measuring time
    Kokkos::parallel_for(
        "warm_up_add_1",
        RangePolicyType(0, nb * ns * ne),
        KOKKOS_LAMBDA(const int i) {
        res1D(i) += 1.;
    });
    Kokkos::fence();

    struct timeval begin, end;
    gettimeofday(&begin, NULL);
    for (unsigned iiter = 0; iiter < niter; iiter++) {
        const int scratch_type = 0;
        int scratch_size = 0;
        scratch_size += ScratchView2::shmem_size(nb, ns);

        TeamPolicyType policy(ne, Kokkos::AUTO);
        policy.set_scratch_size(scratch_type, Kokkos::PerTeam(scratch_size));

        Kokkos::parallel_for(
            Kokkos::Experimental::require(
                policy,
                Kokkos::Experimental::WorkItemProperty::HintLightWeight),
            // lambda functor below
            KOKKOS_LAMBDA(const MemberType &teamMember) {
                const unsigned ie = teamMember.league_rank();

                // create a view for the scratch space
                // as well as sub-view for the element of interest
                ScratchView2 vtemp(teamMember.team_scratch(scratch_type), nb, ns);
                auto loc_res = Kokkos::subview(res3D, Kokkos::ALL, Kokkos::ALL, ie);
                auto loc_iMM = Kokkos::subview(iMM3D, ie, Kokkos::ALL, Kokkos::ALL);

                // store the result of the multiplication by the inverse mass matrix in
                // the scratch space
                KokkosBatched::TeamGemm<
                    MemberType,
                    KokkosBatched::Trans::NoTranspose,
                    KokkosBatched::Trans::NoTranspose,
                    KokkosBatched::Algo::Gemm::Unblocked>::invoke(
                    teamMember, 1.0, loc_iMM, loc_res, 0.0, vtemp);

                // copy from scratch to device memory
                KokkosBatched::TeamCopy<
                    MemberType,
                    KokkosBatched::Trans::NoTranspose>::invoke(
                    teamMember, vtemp, loc_res);
            });
    } // end for niter
    Kokkos::fence();
    gettimeofday(&end, NULL);

    print_achieved_flops(ne, nb, ns, niter, begin, end);
}

bandokihiro avatar Nov 09 '21 21:11 bandokihiro

@bandokihiro Thanks for the code. Can you also tell me what values of ne, nb, ns? and what is your achieved flops?

vqd8a avatar Nov 09 '21 21:11 vqd8a

For the tests I have been running, ns=5, nb=(p+1)^3 for p=2,3,4 and ne=64000. The way to interpret this is in a finite element like discretization, I have ns solution variables, nb DOFs per element, and ne elements in 1 GPU.

For the version I posted, I get a throughput of 100-120 GFLOPS depending on p.

bandokihiro avatar Nov 09 '21 21:11 bandokihiro

Thanks, @bandokihiro . I will look at this.

vqd8a avatar Nov 09 '21 22:11 vqd8a

@vqd8a some things I tried. You mentioned the use of TeamVectorGemm. I assume you referred to just changing TeamGemm into TeamVectorGemm. I looked at the examples in the wiki and I think I understand them. But just to make sure, the first one is the same as what I tried, namely TeamVectorGemm inside a TeamPolicy parallel-for. The second and third examples manually expose the parallelism within a single matrix-matrix multiply, either by putting this parallelism across teams in the inner-most loop or across vector lanes.

That is also something I was wondering in my initial post, namely whether you think, given my use case, if I should manually manage parallelism within a matrix-matrix multiply, or if I can let Kokkos do its thing. As a user, of course, it is easier for me to just call TeamGemm or TeamVectorGemm but I am also open to manually fiddle with the inner-most parallelism if it gets me more performance.

Also, I ran both cases through Apollo. For TeamGemm it gave me an optimal number of threads of 1024. Since the default constructor puts 1 vector lane per thread, in such case, I would have 32 warps working together to compute one gemm. For TeamVectorGemm, Apollo gave me an optimal number of 32 threads. Now I don't know if it plays with the vector size as well, but just setting 32 threads gave terrible results. Then I remembered that 1 vector lane was the default. Constructing the TeamPolicy object like TeamPolicyType policy(ne, 32, 32); gave better results, although still inferior to my initial baseline using TeamGemm (the throughput was halved).

bandokihiro avatar Nov 11 '21 05:11 bandokihiro

In my application, I have many matrix-matrix multiplications to do. I have a view of N square left matrices (each with O(10-100) rows and columns) and a view of N right matrices. There is a one-to-one match between these. I am using the team batched gemm facility in kokkos-kernels, launching N Kokkos teams. I would like to understand what performance knobs I have available for the CUDA backend. Besides the data layout of the two views mentioned previously (which I am sure is important for performance) I have currently identified the following:

* the second argument of the `TeamPolicy` constructor (currently set as `Kokkos::Auto`)

* the arguments of the parallel for besides the lambda (I currently have `Kokkos::Experimental::WorkItemProperty::HintLightWeight` and I am asking the original writer of the code why)

* I am currently using a per-team scratch of size the result of one matrix-matrix multiply. Then copying this using a `TeamCopy` to overwrite the view of the right matrices (since the left matrices are square). Is that needed?

* the `ArgAlgo` template parameter for `TeamGemm` (currently set as `Unblocked`, I tried `Blocked` and the performance was significantly worse)

Is there anything else I should pay attention to? More generally, does this sound like a good use of the TeamGemm? I haven't considered for instance using vector-level parallelism, or using a one Kokkos thread to one matrix multiply mapping (unless Kokkos::Auto happens to give me 1; in fact, for some tests I was looking at, I get 96 threads per team which seems to be a lot if assuming that a thread = a warp)

Hi @bandokihiro. Sorry for the delayed reply and any redundant information given below.

The only knob that is not listed above is the data access pattern. For A, B, and C of LayoutRight, ensure that the sub-views passed to TeamGemm are contiguous:

svA = subview(A, batchId, b, ALL, ALL);
svB = subview(B, batchId, d, ALL, ALL);
svC = subview(C, batchId, f, ALL, ALL);

For square matrices >= 100 in size, I recommend using TeamGemm with TeamPolicy configured with a vector len of 1 and a team size of 1024. You can try this out via:

$ cd /path/to/kokkos-kernels/build/perf_test/blas/blas3
$ ./KokkosBlas3_perf_test --team_size=1024 --test=batched_team --routines=gemm --loop_type=parallel --batch_size_last_dim=0 --matrix_size_start=100x100,100x100,100x100 --matrix_size_stop=256x256,256x256,256x256 --matrix_size_step=160 --batch_size=$((2048*8)) --warm_up_loop=10 --iter=20 --verify=0
Testing gemm...
SCALAR:d, LAYOUT:N6Kokkos11LayoutRightE, DEVICE:N6Kokkos4CudaE, SPACE:N6Kokkos9CudaSpaceE
algorithm,vector_type,transAtransB,alpha,beta,team_size,vector_len,loop_type,A_dims,B_dims,C_dims,warm_up_n,iter,total_time(s),average_time(s),FLOPS,GFLOP/average_time(s)
batched_team,SIMD,NN,1,1,1024,1,parallel,16384x100x100,16384x100x100,16384x100x100,10,20,0.878623,0.0439311,3.2768e+10,745.895

Beyond this, if you would like to try out our new BatchedGemm interface on the develop branch for better performance, please take a look at: https://github.com/kokkos/kokkos-kernels/blob/5b781a1d8362b264d38aa3a6c431b29540d8496f/src/batched/dense/KokkosBatched_Gemm_Decl.hpp#L259-L315

e10harvey avatar Nov 15 '21 15:11 e10harvey

@bandokihiro I did some experiments and here are my results on V100 GPU in the attached Excel file. batchedgemm_test.xlsx There are two sheets in the file (all A, B and C have LayoutRight):

  1. The results of A*B is stored in another view C. Two cases were considered here: 1.1. Same data access pattern: A(BatchSize,M,M), B(BatchSize,M,N), C(BatchSize,M,N) 1.2. Different data access pattern: A(BatchSize,M,M), B(M,N,BatchSize), C(M,N,BatchSize) Three interfaces were used in these tests: (i) TeamGemm with Kokkos::AUTO, (ii) TeamGemm with team size of M*N, and (iii) using our new BatchedGemm interface (BatchedSerialGemm, each thread computes an element of C) as @e10harvey mentioned above
  2. The results of A*B is stored back in B (using scratch memory as you did): 2.1. Same data access pattern: A(BatchSize,M,M), B(BatchSize,M,N) 2.2. Different data access pattern: A(BatchSize,M,M), B(M,N,BatchSize) Two interfaces were used in these tests: (i) TeamGemm with Kokkos::AUTO, (ii) TeamGemm with team size of M*N.

Some main points were observed here:

  • Good performance can be achieved if the matrices have the same data access pattern, such as A(BatchSize,M,M), B(BatchSize,M,N), C(BatchSize,M,N) for LayoutRight
  • A separate view should be used for storing results
  • The new BatchedSerialGemm gives the best performance for the two bullets above
  • If different data access patterns are used, TeamGemm with team size of M*N and scratch memory is recommended

vqd8a avatar Nov 15 '21 18:11 vqd8a

waouh, thanks a lot @vqd8a and @e10harvey!

@e10harvey yes data layout is certainly important. The different layout for the view of right matrices comes from the optimization of other kernels in my code so I would like to keep it that way. Your suggestion of a team size of 1024 and 1 vector lane is indeed what was suggested to me by Apollo.

I have follow up questions on your conclusions @vqd8a:

  • "A separate view should be used for storing results" are you referring to the different-layout-case when you claim this?
  • Have you run the new BatchedSerialGemm for case 2 which uses scratch memory? Your third conclusion would suggest that but I don't see it in the Excel file.
  • Is the use of BatchedSerialGemm within a TeamThreadRange?
  • For case 2, how do you compute the GLOPS? Do you take into account the copy from scratch into the original B as well in the timing? For me, I took the entire kernel (including the copy) with the flop count of only 1 matrix multiply.
  • How did you come up with M*N? Is each thread in charge of one output of the small gemm inside TeamGemm?
  • What is different in the new interface that can affect performance? Have the internals changed?

You also both didn't mention manual management of vector lanes, should this avenue be ruled out?

Thank you very much!

bandokihiro avatar Nov 16 '21 04:11 bandokihiro

You also both didn't mention manual management of vector lanes, should this avenue be ruled out?

For the TeamGemm interface, this can be ruled out. You may experiment with the TeamVectorGemm interface which would be affected by selecting the number of vector lanes.

e10harvey avatar Nov 16 '21 16:11 e10harvey

@bandokihiro

"A separate view should be used for storing results" are you referring to the different-layout-case when you claim this?

I meant both different-layout case and same-layout case.

Have you run the new BatchedSerialGemm for case 2 which uses scratch memory? Your third conclusion would suggest that but I don't see it in the Excel file. Is the use of BatchedSerialGemm within a TeamThreadRange?

BatchedSerialGemm does not support scratch memory because it calls SerialGemm internally. It is implemented such that RangePolicy is applied for a range of batchsize*M*N so each thread is assigned to an element of the output C matrix.

For case 2, how do you compute the GLOPS? Do you take into account the copy from scratch into the original B as well in the timing? For me, I took the entire kernel (including the copy) with the flop count of only 1 matrix multiply.

Yes, it is what I did

How did you come up with M*N? Is each thread in charge of one output of the small gemm inside TeamGemm?

Each team is in charge of a single matrix in the batch and each thread per team is in charge of a single element in the output matrix. I want to match the number of threads per team with the TeamThreadRange's count. See the code below https://github.com/kokkos/kokkos-kernels/blob/57d1992f3724034033f21e85899fa6cb9b10b238/src/batched/dense/impl/KokkosBatched_Gemm_Team_Internal.hpp#L53-L54

What is different in the new interface that can affect performance? Have the internals changed?

See above for BatchedSerialGemm. In fact, the new BatchGemm interface currently supports two underlying algorithms: BatchedSerialGemm and BatchedDblBufGemm, where we use the scratch memory and double buffering technique. This BatchedDblBufGemm might give even better performance than BatchedSerialGemm but it does not support arbitrary matrix sizes yet (@e10harvey, am I correct here?). We are working on it now and can give you an update when it it done. For now, we recommend using BatchedSerialGemm algorithm.

You also both didn't mention manual management of vector lanes, should this avenue be ruled out?

You can try TeamVectorGemm with your use case. For our test cases in the past, TeamGemm performed better than TeamVectorGemm

vqd8a avatar Nov 16 '21 18:11 vqd8a

This BatchedDblBufGemm might give even better performance than BatchedSerialGemm but it does not support arbitrary matrix sizes yet (@e10harvey, am I correct here?).

That's correct. Fixed sized batches are supported by BatchedGemm, BatchedDblBufGemm, and BatchedSerialGemm.

e10harvey avatar Nov 16 '21 20:11 e10harvey