Cody Yu

Results 161 comments of Cody Yu

> I'm a bit of confused with how chunked prefill works as a single step - in the PR description it mentioned token_chunk_size of prompt is scheduled as one step....

> QQ: what's the definition of `num_computed_tokens`? For example, given a prompt `[1,2,3,4,5]`, after the prefill phase (after `process_output`), one new token is generated, we get `[1,2,3,4,5,6]` > > Before...

Can you fix the tests to match the desired behavior instead of removing them?

Since the attention computation is still in FP16, could you benchmark with the original BF16 data type and see if there's still a gap? This could help locate the problem...

> Surely late here, but why is a speculative decoding-aware scheduler needed? Wouldn't it be possible to just assume multi-token generation per-step as default? Because the scheduler has to know...

A problem with making a perfect PR is that it takes longer time and cannot split works to others. I actually don't think the tech debt in v0 came from...

> Can we actually remove this parameter and let each hardware or attention backend choose their own? @liangfu Does this sound good to you if we make such a change?...

@DarkLight1337 somehow this PR failed some speculative decoding tests. For example: ``` pytest spec_decode/e2e/test_mlp_correctness.py -k "test_mqa_scorer[1-32-5-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0]" ``` Before this PR merged (7dbe738d653b563c646883c1ae6f6df927436d01 in main branch): ``` 1 passed, 20 deselected,...