Cody Yu comments

Results 161 comments of


                                            Cody Yu

[Bugfix][Hardware][CPU] Fix CPU model input for decode

> Similar code also exists in Neuron and XPU runners. @comaniac @njhill do we need to update them as well? I'm not maintaining these runners. cc @liangfu

Replace openai to llama2?

@joonspk-research I have the same requirement and have changed the code in my fork. Would you want me to file a PR or it's already an ongoing feature on your...

Replace openai to llama2?

I actually commented this to another issue. I did encounter this problem and my feeling is this framework is tightly-coupled with certain OpenAI models in terms of prompts and response...

Add alternative interface for self-hosted LLM

FYI: I did a quick try on llama-2-7b but crashed, mostly because the model didn't generate the framework acceptable response format. Maybe llama-2-13b or 70b would work, but this is...

[Kernel] Flashinfer correctness fix for v0.1.3

CI failure seems like a real bug ``` [2024-08-09T04:52:05Z] File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 791, in begin_forward -- | [2024-08-09T04:52:05Z] self._wrapper.begin_forward( | [2024-08-09T04:52:05Z] RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257...

[Frontend][Misc] Goodput metric support

> I am not an industry guy so I am not the best guy checking if the definition of TTFT < TTFT SLO and Average TPOT < TPOT SLO is...

[BugFix][Core] Prefix caching: Postpone prefill scheduling for prompts with the same prefix

Thanks for the PR. Unfortunately I don't think this is the strategy we want to have in vLLM core. Although we indeed have this issue, we attempt to solve it...

[Bug]: Simultaneous mm calls lead to permanently degraded performance.

Based on the command I don't think multi-step scheduling is enabled, and AFAIK async output processor is disabled when enforcing eager mode. The huge sampling time in the profile may...

[Bug]: Simultaneous mm calls lead to permanently degraded performance.

I see, so 3 sync instances are actually having 3 processes sending requests, and each of them sends requests sequentially. A more common term for this use case is "concurrency"....

[Bug]: Simultaneous mm calls lead to permanently degraded performance.

So your "sync" is not really "sync"...it's really confusing. Then what I can think of in summary is batch size 3 has lower throughput than batch size 1, because when...