Wenxuan Tan comments

Results 164 comments of


                                            Wenxuan Tan

training issue

`RuntimeError: Stop_waiting response is expected` indicates that the problem is on torch's end. Please ensure your environment is properly set up (PyTorch version, CUDA) and re-run.

[feature] support Gemma2Model for tensor parallem training

Thanks for contributing! To add a new model, we will also need unit tests. Please reference the existing tests and feel free to ping other team members.

1y3m and still no code... "this is a kind reminder..."

A typical embarrasment in the ML community when you put big words but don't deliver them 😅

[DOC]: 环境安装失败

You should use `BUILD_EXT=1 pip install .` and see if that compiles.

[Roadmap] vLLM Roadmap Q4 2024

Any interest in vAttention? https://github.com/vllm-project/vllm/issues/4675

Choice of num_warps

Yeah I think this needs some more adaptive tuning. Triton seems to give reasonable performance with brute-force searching/hardcoding, so folks don't quite care?

Choice of num_warps

It seems the vocab size(32k or 64k at max) will be much larger than any number of threads you have, so better set `num_warps` to the max available on your...

[Ring Attention] Add more detailed references

cc @duanjunwen @TongLi3701

Question about `use_tensor_cores = True or False`

@yzh119 ~~If my understanding is correct, we should almost always use tensor core for decode as it can only make compute faster? Thanks.~~ Actually, only faster when gqa group size...

Question about `use_tensor_cores = True or False`

When I tested with SGLang and `use_tensor_core=True`, the FA2 prefill template is actually slower than cuda core decode, perhaps due to diff in the `Plan` function