SpecForge [Feature] TTT implementation is not the original paper’s concept, but merely data augmentation

Checklist

[x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
[x] 2. Please use English, otherwise it will be closed.

Motivation

Thanks to the SpecForge team for their enthusiastic open-sourcing. But after carefully reading through all the core implementations, I found that:

According to the original principle of EAGLE-style, it seems that this should be "left=True".
TTT implementation is not the original EAGLE-3 paper’s concept, but merely data augmentation. Please refer to the details here.

Related resources

No response

Sep 10 '25 08:09 haiduo

My understanding of TTT (train-time-test) is that it's used to align inference and training. (See this paper: https://arxiv.org/abs/2408.15766.) It's not a form of data augmentation. Without TTT, the model doesn't actually use the features generated by the draft model for next token prediction during training, which is necessary during inference. Without TTT, you'll revert to the Eagle1 approach, which uses an additional loss to align features between the draft and target models.

Sep 11 '25 07:09 jiapingW

Hi @jiapingW ,

Thanks for your response.

Yes, I admit that TTT is used to align inference with training, mainly to reduce exposure bias—I’ve already discussed that here. However, the implementation in the EAGLE-3 paper does seem to require the draft model to generate while training, whereas the implementation here is simply shifting the same input tokens to the left by a few steps (i.e., the TTT step size) and padding zeros on the right. I don’t think that matches the schematic in the EAGLE-3 paper. If that’s the case, then the naming doesn’t really fit—this is just my personal opinion.

Also, I’m not sure whether this approach (shifting the same input tokens to the left by a few steps and padding zeros on the right) is actually what HASS implements—I haven’t had time yet to read that excellent paper.

In addition, aligning inference and training is essentially a concept of LLM knowledge distillation. EAGLE-1/2 fundamentally trained with feature distillation + logits distillation, while EAGLE-3 simply dropped feature distillation. But that doesn’t mean using logits distillation alone can’t align inference with training—they all provide alignment. In short, I think this method is essentially just augmenting the logits in the distillation domain, and can be regarded as a form of ensemble learning. If you’re familiar with distillation (CV/NLP), there’s quite a lot of related work, for example: “Why logit distillation works: A novel knowledge distillation technique by deriving target augmentation and logits distortion.”

By the way, I also think the scale-up success of EAGLE-3 doesn’t come from removing feature distillation, but rather from tweaking small structures (like adding a LayerNorm) and using online logits distillation.

Sep 11 '25 08:09 haiduo

Your understanding is profound. I agree that Eagle3 training is a process of knowledge distillation. Whether the model training goal is to align logits or features, from a high-level perspective, both are correct. Regarding your suggestion that small structural adjustments can produce such a significant effect, perhaps some ablation experiments can be used to analyze this.

Hi @jiapingW ,

Thanks for your response.

Yes, I admit that TTT is used to align inference with training, mainly to reduce exposure bias—I’ve already discussed that here. However, the implementation in the EAGLE-3 paper does seem to require the draft model to generate while training, whereas the implementation here is simply shifting the same input tokens to the left by a few steps (i.e., the TTT step size) and padding zeros on the right. I don’t think that matches the schematic in the EAGLE-3 paper. If that’s the case, then the naming doesn’t really fit—this is just my personal opinion.

Also, I’m not sure whether this approach (shifting the same input tokens to the left by a few steps and padding zeros on the right) is actually what HASS implements—I haven’t had time yet to read that excellent paper.

In addition, aligning inference and training is essentially a concept of LLM knowledge distillation. EAGLE-1/2 fundamentally trained with feature distillation + logits distillation, while EAGLE-3 simply dropped feature distillation. But that doesn’t mean using logits distillation alone can’t align inference with training—they all provide alignment. In short, I think this method is essentially just augmenting the logits in the distillation domain, and can be regarded as a form of ensemble learning. If you’re familiar with distillation (CV/NLP), there’s quite a lot of related work, for example: “Why logit distillation works: A novel knowledge distillation technique by deriving target augmentation and logits distortion.”

By the way, I also think the scale-up success of EAGLE-3 doesn’t come from removing feature distillation, but rather from tweaking small structures (like adding a LayerNorm) and using online logits distillation.

Sep 11 '25 08:09 jiapingW

hi @haiduo do you think the shifting in tokens lead to any real effects in this case. Since shifting is at most 7 steps and the context length can be very long. I am unsure of its actual effect.

Sep 18 '25 05:09 yubofredwang

Hi @yubofredwang ,

Although this simple shifting operation is not a theoretical implementation of "TTT," it should not affect the practical effectiveness. On the contrary, it may even facilitate learning, since it increases the training cost, which is essentially equivalent to enlarging the batch size. Moreover, the introduced noise (i.e., the padding operation) can be regarded as a form of regularization similar to "dropout," which is beneficial for training. In addition, the setting of the "TTT" steps should be treated as a tuning process, analogous to adjusting the batch size as a hyperparameter. Its actual impact, however, needs to be verified through ablation experiments.

Sep 18 '25 06:09 haiduo

Hi @yubofredwang ,

Although this simple shifting operation is not a theoretical implementation of "TTT," it should not affect the practical effectiveness. On the contrary, it may even facilitate learning, since it increases the training cost, which is essentially equivalent to enlarging the batch size. Moreover, the introduced noise (i.e., the padding operation) can be regarded as a form of regularization similar to "dropout," which is beneficial for training. In addition, the setting of the "TTT" steps should be treated as a tuning process, analogous to adjusting the batch size as a hyperparameter. Its actual impact, however, needs to be verified through ablation experiments.

Hello, according to the source code, I think that the padding on the right end will not affect the loss and the train, since the loss mask of this position is 0

Nov 21 '25 07:11 magicllm