Michael Goin comments

Results 271 comments of


                                            Michael Goin

[Model] Support Solar Model

There was a small issue with the SamplerOutput import that I fixed in the latest commit. After that, the model looks to be performing as expected! ``` lm_eval --model vllm...

VLLM for Qwen 2.5 72B produces all !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! outputs, regardless of prompt given GPTQ 4 bits quantization

Is there a model uploaded to HF that I can reproduce with? I would assume this issue is specific to `group_size=32`, is this accurate? I would not be surprised if...

VLLM for Qwen 2.5 72B produces all !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! outputs, regardless of prompt given GPTQ 4 bits quantization

@rainkert no, that PR is strictly fixing a bug with gptq_marlin for MoE layers

[Feature][Hardware][AMD] Enable Scaled FP8 GEMM on ROCm

Hi @HaiShaw thanks for pushing up this chunk of work. Is there a reason you haven't tried enabling AMD explicitly through the existing "fp8" quantization backend with the current checkpoint...

[Hardware/NVIDIA/Kernel] Enable nvidia/DeepSeek-R1-FP4 Model

Could you run an lm-eval to confirm accuracy before ready? i.e. ``` pip install "lm-eval[api]==0.4.7" lm_eval --model vllm --model_args pretrained=nvidia/DeepSeek-R1-FP4,tensor_parallel_size=8,max_model_len=2048,gpu_memory_utilization=0.99 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto ```

Michael Goin

[Model] Support Solar Model

VLLM for Qwen 2.5 72B produces all !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! outputs, regardless of prompt given GPTQ 4 bits quantization

VLLM for Qwen 2.5 72B produces all !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! outputs, regardless of prompt given GPTQ 4 bits quantization

[Feature][Hardware][AMD] Enable Scaled FP8 GEMM on ROCm

[Hardware/NVIDIA/Kernel] Enable nvidia/DeepSeek-R1-FP4 Model

[Hardware/NVIDIA/Kernel] Enable nvidia/DeepSeek-R1-FP4 Model

[misc] Optimize speculative decoding

[Sampler] Adapt to FlashInfer 0.2.3 sampler API

[Sampler] Adapt to FlashInfer 0.2.3 sampler API

Support W4A8 quantization for vllm