FasterTransformer icon indicating copy to clipboard operation
FasterTransformer copied to clipboard

sampling using a single top_k_top_p kernel for all cases

Open bilal2vec opened this issue 2 years ago • 5 comments

Hi,

we want to be able to pass in different values of topk/topp for each element in the batch at runtime for sampling, and it would be great if we could have a single kernel for sampling for all cases.

the main bottleneck for us is that the topktopp sampling kernel doesn't support sampling when k = 0, in all the other cases, we can use the topktopp kernel:

k>=1, p=0 -> k>=1, p=1 can use topktoppsampling k>=1,p>0, can use topktopsampling k=0, p>0, is not supported

Currently, we're passing topk/topp on a per-element basis by looping over each element of the batch in parallelgpt.cc and creating and calling a new instance of either topktopp sampler (for k > 0, p in [0, 1]) or topp sampler (for k = 0, p in [0, 1]) in that loop. This is unfortunately really slow—we're seeing slowdowns of ~1.5-2x when testing on a small model with batch sizes in the 16-64 range and sampling a couple hundred tokens.

There's another behavior with the topktopp kernel that we're concerned about: we tried increasing the topk limit in that kernel to a much larger limit of 10,000 and find that it runs incredibly slowly.

bilal2vec avatar Apr 12 '22 16:04 bilal2vec

  1. It is hard to use single kernel to support all cases of topk and topp because their requirement and computing are different. That's why we split them into three kernels.
  2. FT doesn't have a way to compute different topk/topp in a batch with single kernel now.
  3. FT only focuses on k <= 128 optimization because we don't see real use case with k > 128.

byshiue avatar Apr 12 '22 23:04 byshiue

thanks for the quick reply!!

for 2, do you have any idea if/when support for that will be available? are there any pointers you can give us if we were to implement it on our own?

bilal2vec avatar Apr 12 '22 23:04 bilal2vec

This is the sampling steps we would like to implement


    sort logits-> sorted_logits, sorted_logits_id
    if k=0, p>0
    { 
         softmax sorted_logits
         sample from distribution -> inverse transform sampling, U[0,1]
         return id
     }
     if k=1, p=0
     {
         return max id
     }
     if k>1, p=0
     {
         sorted_logits[i]=-6e4, if i>k
         softmax the new logits
         sample from distribution -> inverse transform sampling, U[0,p]
         return id
     }
     if k>1, p>0
     {
          sorted_logits[i]=-6e4, is i>k
          sample from distribution -> inverse transform sampling, U[0,p]
          return id
     }

This would ideally just be one kernel, so we can batch dynamic topk, topp together. Fairly new to Cuda and wondering if this a good approach that we can try ourselves. Any pointers will help

Also we are looking at https://docs.nvidia.com/cuda/thrust/index.html#sorting, to help with sorting, is this useful and how easy would it be to use these libs in FT?

bharatv007 avatar Apr 13 '22 14:04 bharatv007

You can study the kernels of topk and topp sampling. Their algorithm are really different and the setting of grid size and block may be different, too. So, it is hard to implement them in one kernel.

byshiue avatar Apr 14 '22 01:04 byshiue

We were able to solve it, can close the issue. https://github.com/NVIDIA/FasterTransformer/issues/234#issuecomment-1195532887

bharatv007 avatar Jul 26 '22 14:07 bharatv007