hibukipanim

Results 26 comments of hibukipanim

are logprobs output (and specifically prompt logprobs with echo=True) expected to be working with current V1 (0.7.0)? checking here before opening an issue to reproduce

Please consider prioritizing dynamic / just-in-time 8-bit quantization like [EETQ](https://github.com/NetEase-FuXi/EETQ) which don't require offline quantization step. In example a current advantage of TGI is that you can load an original...

> * Have you tried fp8 marlin? Run with `--quantization fp8` and we will quantize the weights to fp8 in place. This will be faster and more accurate than `eetq`...

I managed to reproduce scheduler hangs (different multimodal inputs) with mistal-small-3.1 also with vLLM 0.8.3, which also preceded with: ``` ValueError: Attempted to assign X + Y = Z multimodal...

> [@hibukipanim](https://github.com/hibukipanim) can you share your engine parameters please? I've tried using suggested flag but still seeing the same error as TS @pySilver please note that I didn't try to...

would be great if partial-chunked-prefills (https://github.com/vllm-project/vllm/pull/10235) support in V1 in considered for the roadmap 🙏

An important feature from V0 which wasn't yet implemented in V1 is [concurrent partial prefills](https://github.com/vllm-project/vllm/pull/10235). Please consider prioritizing it 🙏 An issue that tracks it: https://github.com/vllm-project/vllm/issues/21674 Thanks!

updating that this issue (resulting in similar stacktrace) still exists in v0.6.0 also when using the chat endpoint with `prompt_logprobs` (if server started with `--enable-prefix-caching=True`) tried also to enable `VLLM_USE_FLASHINFER_SAMPLER=1`...

Hi @drubinstein thanks for the suggestion. I tested now with 0.6.3.post1, the exact same snippet fromthe first message here, and it responds ok on first request but running the snippet...

In another related issue thread was this suggestion which sounds simple and I really hope would be implemented: https://github.com/vllm-project/vllm/issues/8268#issuecomment-2611122210 the idea by @mgoin is basically to skip prefix-caching for requests...