[Small LLM] Max tokens fixed at 128?
The reference implementation appears to fix the maximum number of output tokens to 128.
Despite this, the reference scores:
{
'rouge1': 38.7792,
'rouge2': 15.9075,
'rougeL': 24.4957,
'rougeLsum': 35.793,
'gen_len': 8167644,
'gen_num': 13368,
}
imply that the average number of output tokens is ~611 (= gen_len / gen_num).
What's going on?
The gen_len here is not the tokens but the characters I believe.
Llama2-70b computes the gen_tok_len which is then used to compute the tokens per sample
Adding to this, here's a snapshot of mlperf_log_accuracy.json which we obtained by running the reference implementation -
{ "seq_id" : 0, "qsl_idx" : 2844, "datatoken_count" : 128 },
{ "seq_id" : 1, "qsl_idx" : 7863, "datatoken_count" : 128 },
{ "seq_id" : 2, "qsl_idx" : 10880, "datatoken_count" : 128 },
{ "seq_id" : 3, "qsl_idx" : 10847, "datatoken_count" : 128 },
{ "seq_id" : 4, "qsl_idx" : 11808, "datatoken_count" : 128 },
...
...
token_count is 128 for each of the samples.
Accuracy:
{'rouge1': '38.8329', 'rouge2': '15.9667', 'rougeL': '24.5374', 'rougeLsum': '35.8886', 'gen_len': np.int64(8182758), 'gen_num': 13368}
The gen_len here is not the tokens but the characters I believe. Llama2-70b computes the
gen_tok_lenwhich is then used to compute the tokens per sample
Thanks @attafosu. For Llama2-70B, we calculate both gen_len and gen_tok_len e.g.:
{'rouge1': 44.7466, 'rouge2': 22.3524, 'rougeL': 29.1548, 'rougeLsum': 42.2693, 'gen_len': 26328995, 'gen_num': 24576, 'gen_tok_len': 6677253, 'tokens_per_sample': 271.7}
For Llama3.1-405B, too, e.g.:
{'exact_match': 90.12851091992057, 'rougeL': 21.93672502698354, 'gen_len': 23338435, 'gen_num': 8313, 'gen_tok_len': 5456327, 'tokens_per_sample': 656.4}
For the Small LLM, the token-to-character ratio is ~4.8; for Llama2-70B, it's ~2.5; for Llama3.1-405B, it's ~4.3. That's fine given the difference in vocabularies used.
Perhaps we should introduce gen_tok_len for the Small LLM too for consistency and to avoid confusion in the future.
The main question remains: Why is the maximum number of output tokens fixed at 128? From what we see, the model "wants" to "say" more in practically every case, but it's prevented from doing so. This is not very realistic for the summarization task.
Not to mention, that this removes one of the typical optimization challenges: both the input length and the output length being randomly distributed.
@psyhtest Valid point on the osl distribution. iirc one of the reasons was that without finetuning, the 8B was quite verbose, which is evident from most of the generated outputs being 128 tokens (in reality this should be varying, with some being lower, and of course others higher). But I think the actual decision of max 128 tokens was mistakenly borrowed from gpt-j (which is limited in max sequence length). We overlooked the fact of the ground truth output length distribution (attached below) Given that submission is very close, we may have to bring this discussion to the WG to see if we need some revision. From the summary, there's about 4% of ground truth lengths > 128
Ground truth output sequence length summary:
count 13368
mean 72.040171
std 32.064774
min 14
50% 67
90% 107
95% 123
96% 127
97% 133
99.9% 208.266000
max 1893.000000
Clarification: gen_len is CHARACTER count, not TOKEN count
I've analyzed the code and can clarify the confusion. There is no discrepancy - the model correctly respects the 128 token limit.
Key Finding
In language/llama3.1-8b/evaluation.py (lines 127-129):
prediction_lens = [len(pred) for pred in preds] # len() on STRING = characters
result["gen_len"] = np.sum(prediction_lens) # Sum of CHARACTER counts
result["gen_num"] = len(preds) # Number of samples
gen_len counts characters, not tokens. The len() function on a decoded string returns character count.
Math Verification
Your metrics:
- gen_len: 8,167,644
- gen_num: 13,368
- Average: 611 characters per sample
Token-to-character ratio for Llama models: ~4.8 chars/token
- 128 tokens × 4.8 = ~614 characters
Perfect match! The model generates ≤128 tokens, which decode to ~611 characters on average.
Code Confirmation
SUT_VLLM.py line 75:
"max_tokens": 128, # Hard limit enforced
Recommendation
To prevent future confusion, consider adding clarifying metrics:
result["gen_len_chars"] = np.sum(prediction_lens) # Explicit: characters
result["avg_chars_per_sample"] = result["gen_len_chars"] / result["gen_num"]
# Optional: estimate tokens (model-specific ratio)
result["est_avg_tokens_per_sample"] = result["avg_chars_per_sample"] / 4.8
The 128 token limit is working correctly. The confusion stems from metric naming - gen_len suggests tokens but actually counts characters.
From the summary, there's about 4% of ground truth lengths > 128
It looks that close to 100% of generated lengths is > 128, that is the model is much more chatty that the ground truth! Maybe the prompt should have been to create as a concise summary as possible. I think we should consider modifying it for the next round.
WG Meeting: Fix in 6.0.
Just to add on, initially during the taskforce we'd observed that without the 128 token limit we were getting much poorer accuracy despite the system prompt asking the model to be succinct
Perhaps this might be helpful?
If you haven't tried "threatening" LLMs in system prompts, then you should!
That LinkedIn post about "threatening" LLMs is hilarious but actually makes sense! 😄
Given that LLAMA3 is being so chatty (100% outputs >128 tokens vs 4% in ground truth), maybe we do need to get a bit more... assertive with our prompts.
Instead of politely asking "please be concise", something like:
- "Summary MUST be under 50 words. Exceeding this limit will result in immediate rejection."
- "You have a strict 3-sentence limit. Every word counts."
It's funny how models respond better to consequences than kindness - just like they picked up on internet drama during training!
@taran2210's observation that the hard 128-token cutoff actually improves accuracy supports this too. Maybe for v6.0, combine both approaches - stern prompt + token limit?
Hi @psyhtest One request that came up from the WG is whether you can generate some statistics of the generated output sequence lengths when the max output tokens is increased beyond 128. The goal is to see if there's going to be some variation in the output lengths or that it will skew towards the max tokens (the increased value) as seen in the case for 128.
@sahelib25 Can we try please with the reference with max tokens set to e.g. 256?
Hi, when running the dataset with 128, 256, 1K, and 2K max tokens, the model consistently generated outputs of exactly the maximum length with no variation. With 4K max tokens, we start seeing variations in the output lengths.
@psyhtest @attafosu — let us know if you'd prefer to have the discussion at the start of the WG meeting. I've seen both joined over the past two weeks, but Anton had to drop off due to the late hour in Europe. Hopefully we can resolve this offline, but if it's easier to talk during the first 15 minutes of the meeting, please let Miro and me know.
WG Meeting: @psyhtest to try out a few prompts and report