Cody Yu comments

Results 161 comments of


                                            Cody Yu

Multi-Step + Chunked Prefill with Prefill Stepping

> I'm a bit of confused with how chunked prefill works as a single step - in the PR description it mentioned token_chunk_size of prompt is scheduled as one step....

Multi-Step + Chunked Prefill with Prefill Stepping

> QQ: what's the definition of `num_computed_tokens`? For example, given a prompt `[1,2,3,4,5]`, after the prefill phase (after `process_output`), one new token is generated, we get `[1,2,3,4,5,6]` > > Before...

[Bugfix] [Core] don't schedule prefill if freeing kv cache

Can you fix the tests to match the desired behavior instead of removing them?

[V1][PP] Support PP for MultiprocExecutor

Also cc @ruisearch42

[Performance]: FLASHINFER backend is slower than FLASH_ATTN on H100

Since the attention computation is still in FP16, could you benchmark with the original BF16 data type and see if there's still a gap? This could help locate the problem...

[Performance]: FLASHINFER backend is slower than FLASH_ATTN on H100

Closed via #9497

[V1][Spec Decode] Ngram Spec Decode

> Surely late here, but why is a speculative decoding-aware scheduler needed? Wouldn't it be possible to just assume multi-token generation per-step as default? Because the scheduler has to know...

[V1][Spec Decode] Ngram Spec Decode

A problem with making a perfect PR is that it takes longer time and cannot split works to others. I actually don't think the tech debt in v0 came from...

[Benchmark] Add block_size option to benchmark_throughput.py

> Can we actually remove this parameter and let each hardware or attention backend choose their own? @liangfu Does this sound good to you if we make such a change?...

[Model] Add user-configurable task for models that support both generation and embedding

@DarkLight1337 somehow this PR failed some speculative decoding tests. For example: ``` pytest spec_decode/e2e/test_mlp_correctness.py -k "test_mqa_scorer[1-32-5-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0]" ``` Before this PR merged (7dbe738d653b563c646883c1ae6f6df927436d01 in main branch): ``` 1 passed, 20 deselected,...