Roger Wang comments

Results 132 comments of


                                            Roger Wang

[Bug]: Pixtral fails when limit_mm_per_prompt not set

> > Hello @jdf-prog! Just to confirm, you were able to launch the server, but only this particular image ran into an issue, correct? > > Yes, only this particular...

v0.6.1.post1 Release Tracker

#8428 not a real release blocker, but could help with main branch CI

[Bug]: inter-token latency is lower than TPOT in serving benchmark result

> inter-token latency takes TTFT @Jeffwan This is no longer the case and has been fixed by #7372. The reason why we use separate calculations for TPOT is that sometimes...

[Bug]: inter-token latency is lower than TPOT in serving benchmark result

> IMHO the way we are defining ITL here is not very useful and potentially confusing. I think we should report only TTFT and TPOT (in other cases ITL is...

[Bug]: inter-token latency is lower than TPOT in serving benchmark result

> which is ITL is reported as smaller than TPOT, @hyhuang00 yea that's indeed a good point. The only possibility I can think of for this is when the model...

Llama3.2 Vision Model: Guides and Issues

Thank you all for trying out Llama 3.2 vision model on vLLM! As you may already know, multimodal Llama 3.2 is quite different from other LlaVA-style VLMs that we currently...

Llama3.2 Vision Model: Guides and Issues

> I don't think any of the implementations currently have the cross attention projection caches? But for inference, it looks like the outputs of the cross attention kv projections for...

Llama3.2 Vision Model: Guides and Issues

> I am getting the following error: > > ``` > ERROR 09-28 19:27:59 async_llm_engine.py:61] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a,...

[Feature] Consolidate performance benchmark datasets

> **Thoughput Results, this branch** > > sharegpt does not match, will look into this later. > > Dataset Processed Prompts Total Prompt Tokens Total Tokens Total Output Tokens Requests/s...