Bowen Wang
Bowen Wang
> @abmfy Hello, I'm encountering the following error when using multi-GPU parallel processing Here's my startup command: python -m vllm.entrypoints.openai.api_server --model="/public/models/hf_models/DeepSeek-V2-Lite-Chat-FP8-A16" --trust-remote-code -tp 2 -dp 2 --port 8200 --enforce-eager --enable-eplb...
> do you need to update the dockerfile to install the new flashinfer wheel? Let me have a try
Sure, I'll add some tests soon
> Hi @abmfy do you plan to have test updates soon? We can help make them if you don't have time right now Hi, sorry for the delayed response. I’ve...
Hi @WoosukKwon, I’ve added a unit test to verify the conformity of FlashInfer’s sampling kernels with vLLM’s Python implementation. Currently, I’m comparing only the renormalized probabilities for top-k and top-p...
Yes this PR should apply to 0.2.5 too, but according to earlier comments in this thread, FlashInfer 0.2.5 has encountered some accuracy issues. These have been addressed in the main...
```bash VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto ``` |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7885|± |0.0112|...
Merged `main`. Currently failing due to #18086. I’ll merge main again once that PR is merged.
Some sampler tests of v0 seem to be failing due to the removal of the ability to pass pre-generated uniform samples to FlashInfer kernels. v1 sampler tests all passed. Will...
> @abmfy Thank you can you fix this PR? This is part of the release blocker now. Sure, I'll fix the tests that seems to be related to this PR