SpecForge Naive brainstorm: accept length simulator

WARN: I have not learnt spec in details so this is just a naive brainstorm and I can be totally wrong!

Currently, it seems we we report accuracy and loss on eval data. However, what we really care about is the accept length.

Therefore, it would be great to have an API to simulate accept length on various eagle configurations at once, and then both call it automatically on train/val data, and also maybe expose as a normal function to allow users to use it.

~~From my naive view, we may firstly compute outputs of draft model, and then use very quick calculations (e.g. a sliding window on the output token ids) for each configuration to know the accept length.~~ EDIT: briefly read EAGLE 3 and realize they have different hidden states, thus we may need to rerun draft model for each config, but anyway that may be lightweight compared to run full experiments using inference engine.

This has two use cases from my naive view: (1) We know the e2e metric we really care about during training, with almost no extra cost, which may help us a bit in training. (2) We know what may be the best config without having to test each and every eagle configuration, which is time consuming. (3) This may be useful for other scenarios I am interested in as a lightweight simulator.

Potential drawback: I dnk whether the error introduced by inference engine will be so large that this number may be inaccurate.

Jul 24 '25 23:07 fzyzcjy

Thats true

Jul 28 '25 04:07 FlamingoPg

I also urgently need this feature. Is anyone currently developing it? If not, I'd like to try implementing it myself.

Nov 02 '25 10:11 Lihui-Gu

No I do not have time implement this, feel free to PR and looking forward to it

Nov 02 '25 10:11 fzyzcjy

I've tried, implemented, and tested the feature. Here's my plan.

Functional Requirements

The system should prioritize evaluating key metrics like accept length, enabling direct validation on datasets without relying on sglang server.
Facilitate performance analysis and benchmarking of the draft model's efficiency. This allows better testing of inference optimizations (including but not limited to quantization, sparse attention) and their benefits on the draft model side. Using a pre-prepared test set containing: System prompt + User input + Image input (if applicable), Pre-sampled assistant responses from the target model in JSONL format

Implementation Details

Target Model Prefill: Single prefill operation for entire (system prompt + question + answer), Hidden states concatenated with input_ids shifted left by 1 position Sliding Window:

Extract target_ids window
Prepare draft model input
Generate draft_ids using draft_model_generate() with top_k=1
Calculate accept length by comparing draft_ids with target_ids
Slide window by accept length, preserve relevant KV cache
Maintain cache for incremental input between windows Clear cache from autoregressive generation after each draft model call

class Eagle3Model add two main function:

draft_model_generate (decoding by draft model)
evaluation_accept_length(sliding windows)

Validation Method

Results will be validated against sglang server with equivalent parameters:

--speculative-num-steps 7
--speculative-eagle-topk 1
--speculative-num-draft-tokens 7

The implementation will ensure accept length metrics match sglang's results for identical test cases.

That's my implementation idea. I'll organize it and push the code later. If there are any areas for improvement, please let me know!

Nov 05 '25 07:11 Lihui-Gu

@Lihui-Gu hi， great work！May I take a look at your PR?

Nov 06 '25 13:11 ggg-s

https://github.com/sgl-project/SpecForge/pull/279 I prioritize supporting and testing QwenVL models.

Nov 07 '25 06:11 Lihui-Gu