[Feature] Create a standard, balanced, robust multi-domain training dataset
Checklist
- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.
Motivation
Previously, we trained our model with ultrachat and sharegpt, however, these datasets can fall short on tasks such as math, coding and reasoning. The SpecForge team is preparing to release a new multi-domain dataset which contains around 1M data samples for speculative training.
Related resources
No response
Comments: In my experiments, scaling training data will get log scaling law on the accept rate both on pretrain data (Fig.1a) and SFT data (Tabel.9 Scylla+8SFT means 8X sft data).
How many tokens will 1M data samples contain?
ref: https://arxiv.org/pdf/2505.07858
@Ageliss good paper to read, I have not planned for the number of tokens yet. Maybe there is some reference datasets I can try out first, for example https://huggingface.co/datasets/mlabonne/open-perfectblend
@Ageliss Awesome paper on scaling law on spec decoding!! But I still have some questions in the paper, which only used EAGLE2 configuration, and exclude EAGLE3 train-time test + feature fusion. But in the EAGLE3 paper it says in EAGLE2 (w/o train-time test), the scaling property were not to be observed. But in Scylla, the scaling is observed, and the results are strong. Maybe there is something different between your implementation and EAGLE2 offical? 🤔
@Ageliss Awesome paper on scaling law on spec decoding!! But I still have some questions in the paper, which only used EAGLE2 configuration, and exclude EAGLE3 train-time test + feature fusion. But in the EAGLE3 paper it says in EAGLE2 (w/o train-time test), the scaling property were not to be observed. But in Scylla, the scaling is observed, and the results are strong. Maybe there is something different between your implementation and EAGLE2 offical? 🤔
Yes, I share your view. However, I feel that EAGLE-2 may also exhibit a scaling-law phenomenon; it just didn’t use the online training adopted by EAGLE-3. In my opinion, whether a model follows a scaling law depends on the training procedure. I’m skeptical of EAGLE-3’s claim that the “feature loss” is what prevents scaling, because the original paper doesn’t clearly explain how the comparison was made—it only says the dataset was increased, without stating whether the training setup was the same.
@Ageliss Awesome paper on scaling law on spec decoding!! But I still have some questions in the paper, which only used EAGLE2 configuration, and exclude EAGLE3 train-time test + feature fusion. But in the EAGLE3 paper it says in EAGLE2 (w/o train-time test), the scaling property were not to be observed. But in Scylla, the scaling is observed, and the results are strong. Maybe there is something different between your implementation and EAGLE2 offical? 🤔
"exclude EAGLE3 ttt+feature fusion": Our work was done before EAGLE3 paper released thus we didn't apply Scylla on EAGLE3. But these works are independent and can be used in the same time!
"different between your implementation and EAGLE2": Actually, we only added some layernorm layers and we scaled up using the pretrain data.
Thanks for the reply!! Seems like the norm layer plays a critical role in the scaling! It is also mentioned at this issue in EAGLE repo
Thanks for the reply!! Seems like the norm layer plays a critical role in the scaling! It is also mentioned at this issue in EAGLE repo
Yes, I also found a similar trend when combining EAGLE-2 with multimodal settings (LLaVA). It seems that different studies eventually converge on the same path.
Closed as we choose mlabonne/open-perfectblend as a good balanced dataset for training.