Wenxuan Tan
Wenxuan Tan
`RuntimeError: Stop_waiting response is expected` indicates that the problem is on torch's end. Please ensure your environment is properly set up (PyTorch version, CUDA) and re-run.
Thanks for contributing! To add a new model, we will also need unit tests. Please reference the existing tests and feel free to ping other team members.
A typical embarrasment in the ML community when you put big words but don't deliver them 😅
You should use `BUILD_EXT=1 pip install .` and see if that compiles.
Any interest in vAttention? https://github.com/vllm-project/vllm/issues/4675
Yeah I think this needs some more adaptive tuning. Triton seems to give reasonable performance with brute-force searching/hardcoding, so folks don't quite care?
It seems the vocab size(32k or 64k at max) will be much larger than any number of threads you have, so better set `num_warps` to the max available on your...
cc @duanjunwen @TongLi3701
@yzh119 ~~If my understanding is correct, we should almost always use tensor core for decode as it can only make compute faster? Thanks.~~ Actually, only faster when gqa group size...
When I tested with SGLang and `use_tensor_core=True`, the FA2 prefill template is actually slower than cuda core decode, perhaps due to diff in the `Plan` function