Wenxuan Tan

Results 164 comments of Wenxuan Tan

`RuntimeError: Stop_waiting response is expected` indicates that the problem is on torch's end. Please ensure your environment is properly set up (PyTorch version, CUDA) and re-run.

Thanks for contributing! To add a new model, we will also need unit tests. Please reference the existing tests and feel free to ping other team members.

A typical embarrasment in the ML community when you put big words but don't deliver them 😅

You should use `BUILD_EXT=1 pip install .` and see if that compiles.

Any interest in vAttention? https://github.com/vllm-project/vllm/issues/4675

Yeah I think this needs some more adaptive tuning. Triton seems to give reasonable performance with brute-force searching/hardcoding, so folks don't quite care?

It seems the vocab size(32k or 64k at max) will be much larger than any number of threads you have, so better set `num_warps` to the max available on your...

@yzh119 ~~If my understanding is correct, we should almost always use tensor core for decode as it can only make compute faster? Thanks.~~ Actually, only faster when gqa group size...

When I tested with SGLang and `use_tensor_core=True`, the FA2 prefill template is actually slower than cuda core decode, perhaps due to diff in the `Plan` function