FasterTransformer
FasterTransformer copied to clipboard
sampling using a single top_k_top_p kernel for all cases
Hi,
we want to be able to pass in different values of topk/topp for each element in the batch at runtime for sampling, and it would be great if we could have a single kernel for sampling for all cases.
the main bottleneck for us is that the topktopp sampling kernel doesn't support sampling when k = 0, in all the other cases, we can use the topktopp kernel:
k>=1, p=0 -> k>=1, p=1 can use topktoppsampling k>=1,p>0, can use topktopsampling k=0, p>0, is not supported
Currently, we're passing topk/topp on a per-element basis by looping over each element of the batch in parallelgpt.cc
and creating and calling a new instance of either topktopp sampler (for k > 0, p in [0, 1]) or topp sampler (for k = 0, p in [0, 1]) in that loop. This is unfortunately really slow—we're seeing slowdowns of ~1.5-2x when testing on a small model with batch sizes in the 16-64 range and sampling a couple hundred tokens.
There's another behavior with the topktopp kernel that we're concerned about: we tried increasing the topk limit in that kernel to a much larger limit of 10,000 and find that it runs incredibly slowly.
- It is hard to use single kernel to support all cases of topk and topp because their requirement and computing are different. That's why we split them into three kernels.
- FT doesn't have a way to compute different topk/topp in a batch with single kernel now.
- FT only focuses on k <= 128 optimization because we don't see real use case with k > 128.
thanks for the quick reply!!
for 2, do you have any idea if/when support for that will be available? are there any pointers you can give us if we were to implement it on our own?
This is the sampling steps we would like to implement
sort logits-> sorted_logits, sorted_logits_id
if k=0, p>0
{
softmax sorted_logits
sample from distribution -> inverse transform sampling, U[0,1]
return id
}
if k=1, p=0
{
return max id
}
if k>1, p=0
{
sorted_logits[i]=-6e4, if i>k
softmax the new logits
sample from distribution -> inverse transform sampling, U[0,p]
return id
}
if k>1, p>0
{
sorted_logits[i]=-6e4, is i>k
sample from distribution -> inverse transform sampling, U[0,p]
return id
}
This would ideally just be one kernel, so we can batch dynamic topk, topp together. Fairly new to Cuda and wondering if this a good approach that we can try ourselves. Any pointers will help
Also we are looking at https://docs.nvidia.com/cuda/thrust/index.html#sorting, to help with sorting, is this useful and how easy would it be to use these libs in FT?
You can study the kernels of topk and topp sampling. Their algorithm are really different and the setting of grid size and block may be different, too. So, it is hard to implement them in one kernel.
We were able to solve it, can close the issue. https://github.com/NVIDIA/FasterTransformer/issues/234#issuecomment-1195532887