Cody Yu
Cody Yu
> Similar code also exists in Neuron and XPU runners. @comaniac @njhill do we need to update them as well? I'm not maintaining these runners. cc @liangfu
@joonspk-research I have the same requirement and have changed the code in my fork. Would you want me to file a PR or it's already an ongoing feature on your...
I actually commented this to another issue. I did encounter this problem and my feeling is this framework is tightly-coupled with certain OpenAI models in terms of prompts and response...
FYI: I did a quick try on llama-2-7b but crashed, mostly because the model didn't generate the framework acceptable response format. Maybe llama-2-13b or 70b would work, but this is...
CI failure seems like a real bug ``` [2024-08-09T04:52:05Z] File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 791, in begin_forward -- | [2024-08-09T04:52:05Z] self._wrapper.begin_forward( | [2024-08-09T04:52:05Z] RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257...
> I am not an industry guy so I am not the best guy checking if the definition of TTFT < TTFT SLO and Average TPOT < TPOT SLO is...
Thanks for the PR. Unfortunately I don't think this is the strategy we want to have in vLLM core. Although we indeed have this issue, we attempt to solve it...
Based on the command I don't think multi-step scheduling is enabled, and AFAIK async output processor is disabled when enforcing eager mode. The huge sampling time in the profile may...
I see, so 3 sync instances are actually having 3 processes sending requests, and each of them sends requests sequentially. A more common term for this use case is "concurrency"....
So your "sync" is not really "sync"...it's really confusing. Then what I can think of in summary is batch size 3 has lower throughput than batch size 1, because when...