inference icon indicating copy to clipboard operation
inference copied to clipboard

[Small LLM] Max tokens fixed at 128?

Open psyhtest opened this issue 5 months ago • 16 comments

The reference implementation appears to fix the maximum number of output tokens to 128.

Despite this, the reference scores:

{
        'rouge1': 38.7792,
        'rouge2': 15.9075,
        'rougeL': 24.4957,
        'rougeLsum': 35.793,
        'gen_len': 8167644,
        'gen_num': 13368,
}

imply that the average number of output tokens is ~611 (= gen_len / gen_num).

What's going on?

psyhtest avatar Jul 15 '25 15:07 psyhtest

The gen_len here is not the tokens but the characters I believe. Llama2-70b computes the gen_tok_len which is then used to compute the tokens per sample

attafosu avatar Jul 15 '25 16:07 attafosu

Adding to this, here's a snapshot of mlperf_log_accuracy.json which we obtained by running the reference implementation -

{ "seq_id" : 0, "qsl_idx" : 2844, "datatoken_count" : 128 },
{ "seq_id" : 1, "qsl_idx" : 7863, "datatoken_count" : 128 },
{ "seq_id" : 2, "qsl_idx" : 10880, "datatoken_count" : 128 },
{ "seq_id" : 3, "qsl_idx" : 10847, "datatoken_count" : 128 },
{ "seq_id" : 4, "qsl_idx" : 11808, "datatoken_count" : 128 },
...
...

token_count is 128 for each of the samples.

Accuracy:

{'rouge1': '38.8329', 'rouge2': '15.9667', 'rougeL': '24.5374', 'rougeLsum': '35.8886', 'gen_len': np.int64(8182758), 'gen_num': 13368}

sahelib25 avatar Jul 15 '25 16:07 sahelib25

The gen_len here is not the tokens but the characters I believe. Llama2-70b computes the gen_tok_len which is then used to compute the tokens per sample

Thanks @attafosu. For Llama2-70B, we calculate both gen_len and gen_tok_len e.g.:

{'rouge1': 44.7466, 'rouge2': 22.3524, 'rougeL': 29.1548, 'rougeLsum': 42.2693, 'gen_len': 26328995, 'gen_num': 24576, 'gen_tok_len': 6677253, 'tokens_per_sample': 271.7}

For Llama3.1-405B, too, e.g.:

{'exact_match': 90.12851091992057, 'rougeL': 21.93672502698354, 'gen_len': 23338435, 'gen_num': 8313, 'gen_tok_len': 5456327, 'tokens_per_sample': 656.4}

For the Small LLM, the token-to-character ratio is ~4.8; for Llama2-70B, it's ~2.5; for Llama3.1-405B, it's ~4.3. That's fine given the difference in vocabularies used.

Perhaps we should introduce gen_tok_len for the Small LLM too for consistency and to avoid confusion in the future.

psyhtest avatar Jul 16 '25 09:07 psyhtest

The main question remains: Why is the maximum number of output tokens fixed at 128? From what we see, the model "wants" to "say" more in practically every case, but it's prevented from doing so. This is not very realistic for the summarization task.

Not to mention, that this removes one of the typical optimization challenges: both the input length and the output length being randomly distributed.

psyhtest avatar Jul 16 '25 09:07 psyhtest

@psyhtest Valid point on the osl distribution. iirc one of the reasons was that without finetuning, the 8B was quite verbose, which is evident from most of the generated outputs being 128 tokens (in reality this should be varying, with some being lower, and of course others higher). But I think the actual decision of max 128 tokens was mistakenly borrowed from gpt-j (which is limited in max sequence length). We overlooked the fact of the ground truth output length distribution (attached below) Given that submission is very close, we may have to bring this discussion to the WG to see if we need some revision. From the summary, there's about 4% of ground truth lengths > 128

Ground truth output sequence length summary:

count    13368
mean        72.040171
std         32.064774
min         14
50%         67
90%        107
95%        123
96%        127
97%        133
99.9%      208.266000
max       1893.000000

attafosu avatar Jul 17 '25 16:07 attafosu

Clarification: gen_len is CHARACTER count, not TOKEN count

I've analyzed the code and can clarify the confusion. There is no discrepancy - the model correctly respects the 128 token limit.

Key Finding

In language/llama3.1-8b/evaluation.py (lines 127-129):

prediction_lens = [len(pred) for pred in preds]  # len() on STRING = characters
result["gen_len"] = np.sum(prediction_lens)      # Sum of CHARACTER counts
result["gen_num"] = len(preds)                   # Number of samples

gen_len counts characters, not tokens. The len() function on a decoded string returns character count.

Math Verification

Your metrics:

  • gen_len: 8,167,644
  • gen_num: 13,368
  • Average: 611 characters per sample

Token-to-character ratio for Llama models: ~4.8 chars/token

  • 128 tokens × 4.8 = ~614 characters

Perfect match! The model generates ≤128 tokens, which decode to ~611 characters on average.

Code Confirmation

SUT_VLLM.py line 75:

"max_tokens": 128,  # Hard limit enforced

Recommendation

To prevent future confusion, consider adding clarifying metrics:

result["gen_len_chars"] = np.sum(prediction_lens)  # Explicit: characters
result["avg_chars_per_sample"] = result["gen_len_chars"] / result["gen_num"]
# Optional: estimate tokens (model-specific ratio)
result["est_avg_tokens_per_sample"] = result["avg_chars_per_sample"] / 4.8

The 128 token limit is working correctly. The confusion stems from metric naming - gen_len suggests tokens but actually counts characters.

anivar avatar Jul 20 '25 13:07 anivar

From the summary, there's about 4% of ground truth lengths > 128

It looks that close to 100% of generated lengths is > 128, that is the model is much more chatty that the ground truth! Maybe the prompt should have been to create as a concise summary as possible. I think we should consider modifying it for the next round.

psyhtest avatar Jul 22 '25 15:07 psyhtest

WG Meeting: Fix in 6.0.

mrmhodak avatar Jul 22 '25 16:07 mrmhodak

Just to add on, initially during the taskforce we'd observed that without the 128 token limit we were getting much poorer accuracy despite the system prompt asking the model to be succinct

taran2210 avatar Jul 22 '25 16:07 taran2210

Perhaps this might be helpful?

If you haven't tried "threatening" LLMs in system prompts, then you should!

psyhtest avatar Jul 22 '25 16:07 psyhtest

That LinkedIn post about "threatening" LLMs is hilarious but actually makes sense! 😄

Given that LLAMA3 is being so chatty (100% outputs >128 tokens vs 4% in ground truth), maybe we do need to get a bit more... assertive with our prompts.

Instead of politely asking "please be concise", something like:

  • "Summary MUST be under 50 words. Exceeding this limit will result in immediate rejection."
  • "You have a strict 3-sentence limit. Every word counts."

It's funny how models respond better to consequences than kindness - just like they picked up on internet drama during training!

@taran2210's observation that the hard 128-token cutoff actually improves accuracy supports this too. Maybe for v6.0, combine both approaches - stern prompt + token limit?

anivar avatar Jul 24 '25 15:07 anivar

Hi @psyhtest One request that came up from the WG is whether you can generate some statistics of the generated output sequence lengths when the max output tokens is increased beyond 128. The goal is to see if there's going to be some variation in the output lengths or that it will skew towards the max tokens (the increased value) as seen in the case for 128.

attafosu avatar Sep 02 '25 17:09 attafosu

@sahelib25 Can we try please with the reference with max tokens set to e.g. 256?

psyhtest avatar Sep 11 '25 14:09 psyhtest

Hi, when running the dataset with 128, 256, 1K, and 2K max tokens, the model consistently generated outputs of exactly the maximum length with no variation. With 4K max tokens, we start seeing variations in the output lengths.

sahelib25 avatar Sep 16 '25 10:09 sahelib25

@psyhtest @attafosu — let us know if you'd prefer to have the discussion at the start of the WG meeting. I've seen both joined over the past two weeks, but Anton had to drop off due to the late hour in Europe. Hopefully we can resolve this offline, but if it's easier to talk during the first 15 minutes of the meeting, please let Miro and me know.

hanyunfan avatar Sep 16 '25 19:09 hanyunfan

WG Meeting: @psyhtest to try out a few prompts and report

mrmhodak avatar Sep 23 '25 16:09 mrmhodak