Roger Wang

Results 132 comments of Roger Wang

> > Hello @jdf-prog! Just to confirm, you were able to launch the server, but only this particular image ran into an issue, correct? > > Yes, only this particular...

#8428 not a real release blocker, but could help with main branch CI

> inter-token latency takes TTFT @Jeffwan This is no longer the case and has been fixed by #7372. The reason why we use separate calculations for TPOT is that sometimes...

> IMHO the way we are defining ITL here is not very useful and potentially confusing. I think we should report only TTFT and TPOT (in other cases ITL is...

> which is ITL is reported as smaller than TPOT, @hyhuang00 yea that's indeed a good point. The only possibility I can think of for this is when the model...

Thank you all for trying out Llama 3.2 vision model on vLLM! As you may already know, multimodal Llama 3.2 is quite different from other LlaVA-style VLMs that we currently...

> I don't think any of the implementations currently have the cross attention projection caches? But for inference, it looks like the outputs of the cross attention kv projections for...

> I am getting the following error: > > ``` > ERROR 09-28 19:27:59 async_llm_engine.py:61] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a,...

> **Thoughput Results, this branch** > > sharegpt does not match, will look into this later. > > Dataset Processed Prompts Total Prompt Tokens Total Tokens Total Output Tokens Requests/s...