MMuzzammil1 comments

Results 6 comments of


                                            MMuzzammil1

Phi-2 q4f16_1 runs faster when compiled without `tvm.relax.transform.FuseOps()` and `tvm.relax.transform.FuseTIR()` transformations

Hi @0xDEADFED5. I created this issue for the "Phi-2" model (https://huggingface.co/microsoft/phi-2). Not sure of the behaviour of the Llama-3.

Phi-2 q4f16_1 runs faster when compiled without `tvm.relax.transform.FuseOps()` and `tvm.relax.transform.FuseTIR()` transformations

I'll run the benchmarks to check that. But @0xDEADFED5 isn't the decode speed at least independent of the prompt input?

VLLM does not support EAGLE Spec Decode when deploying EAGLE-Qwen2-7B-Instruct model

I think this issue has been fixed in the release v0.6.2 of vllm now. Please see this: https://github.com/vllm-project/vllm/pull/8790.

EAGLE-3 Training and Test Data

> [@hongyanz](https://github.com/hongyanz) By the way, this is the accept length for Qwen3-8B-Eagle3 in code generation, and its TPS (tokens per second) can reach nearly 500. @jiahe7ay May I ask which...

EAGLE-3 Training and Test Data

> > > [@hongyanz](https://github.com/hongyanz) By the way, this is the accept length for Qwen3-8B-Eagle3 in code generation, and its TPS (tokens per second) can reach nearly 500. > > >...

EAGLE-3 Training and Test Data

@jiahe7ay do you have some results for temperature=1 for this draft model? Or you have mostly tested it for t=0?