Bowen Wang comments

Results 26 comments of


                                            Bowen Wang

[Feature] Expert Parallelism Load Balancer (EPLB)

> @abmfy Hello, I'm encountering the following error when using multi-GPU parallel processing Here's my startup command: python -m vllm.entrypoints.openai.api_server --model="/public/models/hf_models/DeepSeek-V2-Lite-Chat-FP8-A16" --trust-remote-code -tp 2 -dp 2 --port 8200 --enforce-eager --enable-eplb...

[Sampler] Adapt to FlashInfer 0.2.3 sampler API

> do you need to update the dockerfile to install the new flashinfer wheel? Let me have a try

[Sampler] Adapt to FlashInfer 0.2.3 sampler API

Sure, I'll add some tests soon

[Sampler] Adapt to FlashInfer 0.2.3 sampler API

> Hi @abmfy do you plan to have test updates soon? We can help make them if you don't have time right now Hi, sorry for the delayed response. I’ve...

[Sampler] Adapt to FlashInfer 0.2.3 sampler API

Hi @WoosukKwon, I’ve added a unit test to verify the conformity of FlashInfer’s sampling kernels with vLLM’s Python implementation. Currently, I’m comparing only the renormalized probabilities for top-k and top-p...

[Sampler] Adapt to FlashInfer 0.2.3 sampler API

Yes this PR should apply to 0.2.5 too, but according to earlier comments in this thread, FlashInfer 0.2.5 has encountered some accuracy issues. These have been addressed in the main...

[Sampler] Adapt to FlashInfer 0.2.3 sampler API

```bash VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto ``` |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7885|± |0.0112|...