hibukipanim comments

Results 26 comments of


                                            hibukipanim

[V1] Feedback Thread

are logprobs output (and specifically prompt logprobs with echo=True) expected to be working with current V1 (0.7.0)? checking here before opening an issue to reproduce

[Roadmap] vLLM Roadmap Q3 2024

Please consider prioritizing dynamic / just-in-time 8-bit quantization like [EETQ](https://github.com/NetEase-FuXi/EETQ) which don't require offline quantization step. In example a current advantage of TGI is that you can load an original...

[Roadmap] vLLM Roadmap Q3 2024

> * Have you tried fp8 marlin? Run with `--quantization fp8` and we will quantize the weights to fp8 in place. This will be faster and more accurate than `eetq`...

[Bug]: Mistral 3.1 Small Image inference is broken on 0.8.4

I managed to reproduce scheduler hangs (different multimodal inputs) with mistal-small-3.1 also with vLLM 0.8.3, which also preceded with: ``` ValueError: Attempted to assign X + Y = Z multimodal...

[Bug]: Mistral 3.1 Small Image inference is broken on 0.8.4

> [@hibukipanim](https://github.com/hibukipanim) can you share your engine parameters please? I've tried using suggested flag but still seeing the same error as TS @pySilver please note that I didn't try to...

[Roadmap] vLLM Roadmap Q2 2025

would be great if partial-chunked-prefills (https://github.com/vllm-project/vllm/pull/10235) support in V1 in considered for the roadmap 🙏

[Roadmap] vLLM Roadmap Q4 2025

An important feature from V0 which wasn't yet implemented in V1 is [concurrent partial prefills](https://github.com/vllm-project/vllm/pull/10235). Please consider prioritizing it 🙏 An issue that tracks it: https://github.com/vllm-project/vllm/issues/21674 Thanks!

[Bug]: with `--enable-prefix-caching` , `/completions` crashes server with `echo=True` above certain prompt length

updating that this issue (resulting in similar stacktrace) still exists in v0.6.0 also when using the chat endpoint with `prompt_logprobs` (if server started with `--enable-prefix-caching=True`) tried also to enable `VLLM_USE_FLASHINFER_SAMPLER=1`...

[Bug]: with `--enable-prefix-caching` , `/completions` crashes server with `echo=True` above certain prompt length

Hi @drubinstein thanks for the suggestion. I tested now with 0.6.3.post1, the exact same snippet fromthe first message here, and it responds ok on first request but running the snippet...

[Bug]: with `--enable-prefix-caching` , `/completions` crashes server with `echo=True` above certain prompt length

In another related issue thread was this suggestion which sounds simple and I really hope would be implemented: https://github.com/vllm-project/vllm/issues/8268#issuecomment-2611122210 the idea by @mgoin is basically to skip prefix-caching for requests...