GUO-QING JIANG
GUO-QING JIANG
Any updates on this topic? Is such behavior normal? Qwen2 much slower than Vicuna and Llama?
> my guess is the EAGLE 3 is using same training method as [HASS](https://github.com/HArmonizedSS/HASS) except > > 1. As stated in v3 paper, EAGLE does not have hidden states distillation...
> have you get the acceptance rate increased by replacing the last layer hidden extraction to the [2nd, mid, len-2] layer of hidden status along? I observe worsen training time...
@carlbunny > Can you elaborate more on using input_ids for train time test? For prediction the second token, you are not using âₜ₊₁ but the corresponding next token in the...
> > We only get benefit on acceptance rate using the hidden fusion (about 5 -> 5.7), hard to get benefit from train time test (cannot reproduce >6.5). > >...
Comments: In my experiments, scaling training data will get log scaling law on the accept rate both on pretrain data (Fig.1a) and SFT data (Tabel.9 Scylla+8SFT means 8X sft data)....
> [@Ageliss](https://github.com/Ageliss) Awesome paper on scaling law on spec decoding!! But I still have some questions in the paper, which only used EAGLE2 configuration, and exclude EAGLE3 train-time test +...