Cody Yu
Cody Yu
> I'm a bit of confused with how chunked prefill works as a single step - in the PR description it mentioned token_chunk_size of prompt is scheduled as one step....
> QQ: what's the definition of `num_computed_tokens`? For example, given a prompt `[1,2,3,4,5]`, after the prefill phase (after `process_output`), one new token is generated, we get `[1,2,3,4,5,6]` > > Before...
Can you fix the tests to match the desired behavior instead of removing them?
Also cc @ruisearch42
Since the attention computation is still in FP16, could you benchmark with the original BF16 data type and see if there's still a gap? This could help locate the problem...
Closed via #9497
> Surely late here, but why is a speculative decoding-aware scheduler needed? Wouldn't it be possible to just assume multi-token generation per-step as default? Because the scheduler has to know...
A problem with making a perfect PR is that it takes longer time and cannot split works to others. I actually don't think the tech debt in v0 came from...
> Can we actually remove this parameter and let each hardware or attention backend choose their own? @liangfu Does this sound good to you if we make such a change?...
@DarkLight1337 somehow this PR failed some speculative decoding tests. For example: ``` pytest spec_decode/e2e/test_mlp_correctness.py -k "test_mqa_scorer[1-32-5-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0]" ``` Before this PR merged (7dbe738d653b563c646883c1ae6f6df927436d01 in main branch): ``` 1 passed, 20 deselected,...