Tokenizer with bos and eos token id sharing and "[WARNING] Example already has an EOS token appended"
Have been using the trainer functionality for awhile, but in trying it with the new Hugging Face's SmolLM 135M model, no matter what the dataset, I'd end up with EOS token warnings (see below). It's possible this is just a new model quirk, which may be related to the architecture being LlamaForCausalLM but the tokenizer being GPT2Tokenizer.
My hunch is that because the eos and bos token id are both the same - token id 0 - and maybe also compounded with the padding? - is that it's appending a bos and seeing the eos is already there because they share the same token id.
[WARNING] Example already has an EOS token appended
[WARNING] Example already has an EOS token appended
Iter 60: Train loss 2.735, Learning Rate 1.000e-05, It/sec 7.340, Tokens/sec 8503.193, Trained Tokens 64104, Peak mem 4.957 GB
[WARNING] Example already has an EOS token appended
(repeats)
In looking at the vocab and tokenizer config, it looks like for this model, eos and bos are the same token <|endoftext|>. Here it is in the tokenizer.json:
{
"version": "1.0",
"truncation": null,
"padding": null,
"added_tokens": [
{
"id": 0,
"content": "<|endoftext|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
}
Came across a recent issue that seemed to touch on this topic a bit: https://github.com/ml-explore/mlx-examples/issues/811
I could be looking at this incorrectly, but I believe the current implementation assumes all tokenizers handle EOS tokens similarly, leading to unnecessary warnings and potential issues with models that have different tokenizer behaviors. Added a bunch of print statements around line 94 batch = [tokenizer.encode(dataset[j]) for j in batch_idx[i]] in trainer.py.
Also put the debugging script here: https://gist.github.com/fblissjr/a87bed18e6f68cd7734e52a238fbb42b
Here's the train.jsonl data (kept it to a single row of data for this writeup):
{"text": "Question: Will Gemini take video as an input modality?\nContext: Created by Peter Barnett on 2023-08-29 18:43 UTC, closes on 2023-12-06 17:36 UTC\nBetting: YES 98.57% NO 1.43% | 16 Bettors\nDescription: This question resolves YES if Google Deemind's Gemini is trained to accept video as one of its input modalities. Otherwise, it resolves NO.\n\nThis question will resolve YES if any model from Google Deepmind called Gemini is trained to accept video as input. This would include models called \"Video Gemini\", \"V-Gemini\", etc. \n\nIf Gemini is not released before 2025, this question will resolve N/A.\n\nThis question will resolve on the basis of all of the models that are revealed to have Gemini in their name within 24 hours of the first official release. Hense, the equivalent of Image GPT would NOT count. \nPredict the outcome: The outcome is YES. The event described in the question did occur."}
Here's an example of what I'm seeing with llama3 - which is correct and runs fine without issue, as we'd expect:
model: mlx-community/Meta-Llama-3-8B-Instruct-8bit
batch_idx[0]: [2648, 3947, 8592, 8422]
b-1: 13
b: [14924, 25, 4946, 77434, 11, 37355, 5578, 9499, 369, 5587, 220, 2366, 19, 387, 5190, 1109, 5578, 320, 2550, 16, 12, 2366, 15, 87527, 2014, 25, 4388, 555, 10167, 84641, 389, 220, 2366, 18, 12, 717, 12, 2148, 220, 868, 25, 975, 28503, 11, 34350, 389, 220, 2366, 19, 12, 2839, 12, 2148, 220, 868, 25, 2946, 28503, 198, 33, 52189, 25, 14410, 220, 3534, 13, 2437, 4, 5782, 220, 17, 13, 3264, 4, 765, 220, 845, 68688, 1105, 198, 5116, 25, 59813, 311, 279, 5578, 1990, 220, 2550, 16, 12, 2366, 15, 11, 690, 279, 1592, 1768, 1598, 64048, 369, 5587, 220, 2366, 19, 304, 77434, 387, 6928, 320, 275, 3445, 46039, 1109, 5578, 2305, 26, 64397, 311, 14410, 8, 477, 8389, 320, 275, 3445, 76214, 1109, 5578, 2305, 26, 64397, 311, 5782, 12106, 4815, 40, 690, 9006, 420, 994, 279, 4033, 13443, 2713, 520, 3788, 1129, 268, 81858, 8637, 3978, 15258, 75, 1339, 437, 62886, 43625, 5706, 39151, 12, 5162, 16, 10539, 4102, 1432, 54644, 279, 15632, 25, 220, 578, 15632, 374, 14410, 13, 578, 1567, 7633, 304, 279, 3488, 1550, 12446, 13]
tokenizer.eos_token_id: 128009
post b.append(tokenizer.eos_token_id): [14924, 25, 4946, 77434, 11, 37355, 5578, 9499, 369, 5587, 220, 2366, 19, 387, 5190, 1109, 5578, 320, 2550, 16, 12, 2366, 15, 87527, 2014, 25, 4388, 555, 10167, 84641, 389, 220, 2366, 18, 12, 717, 12, 2148, 220, 868, 25, 975, 28503, 11, 34350, 389, 220, 2366, 19, 12, 2839, 12, 2148, 220, 868, 25, 2946, 28503, 198, 33, 52189, 25, 14410, 220, 3534, 13, 2437, 4, 5782, 220, 17, 13, 3264, 4, 765, 220, 845, 68688, 1105, 198, 5116, 25, 59813, 311, 279, 5578, 1990, 220, 2550, 16, 12, 2366, 15, 11, 690, 279, 1592, 1768, 1598, 64048, 369, 5587, 220, 2366, 19, 304, 77434, 387, 6928, 320, 275, 3445, 46039, 1109, 5578, 2305, 26, 64397, 311, 14410, 8, 477, 8389, 320, 275, 3445, 76214, 1109, 5578, 2305, 26, 64397, 311, 5782, 12106, 4815, 40, 690, 9006, 420, 994, 279, 4033, 13443, 2713, 520, 3788, 1129, 268, 81858, 8637, 3978, 15258, 75, 1339, 437, 62886, 43625, 5706, 39151, 12, 5162, 16, 10539, 4102, 1432, 54644, 279, 15632, 25, 220, 578, 15632, 374, 14410, 13, 578, 1567, 7633, 304, 279, 3488, 1550, 12446, 13, 128009]
And here's with SmolLM 135M:
model: "HuggingFaceTB/SmolLM-135M"
batch_idx[0]: [185, 464, 493, 530]
b-1: 0
b: [17872, 42, 7903, 17326, 4726, 260, 216, 34, 32, 34, 35, 428, 1433, 1717, 2260, 23278, 47, 198, 17548, 42, 28878, 411, 12401, 5350, 335, 216, 34, 32, 34, 35, 29, 32, 35, 29, 33, 35, 216, 33, 38, 42, 36, 38, 31493, 28, 29508, 335, 216, 34, 32, 34, 35, 29, 33, 32, 29, 34, 32, 216, 33, 38, 42, 34, 33, 31493, 198, 16170, 862, 42, 718, 2097, 216, 32, 30, 33, 36, 21, 10921, 216, 41, 41, 30, 40, 38, 21, 2504, 216, 33, 37, 6225, 100, 579, 198, 18454, 42, 5141, 1742, 264, 30, 18035, 30, 2295, 31, 14144, 31, 34, 32, 34, 35, 79, 66, 1433, 1717, 79, 11265, 79, 51, 1110, 198, 44509, 260, 7616, 42, 0, 504, 7616, 314, 10921, 30, 378, 2121, 3873, 281, 260, 1962, 1250, 441, 1689, 30, 0]
tokenizer.eos_token_id: 0
[WARNING] Example already has an EOS token appended
Quick debug with a script below:
Initial data Row 7399 does not contain a BOS token.
Initial data Row 7399 does not contain a PAD token.
After preprocessing Row 7399 does not contain a BOS token.
After preprocessing Row 7399 does not contain a PAD token.
After preprocessing Row 7399: [17872, 42, 15107, 233, 226, 216, 34, 32, 34, 35, 15140, 6676, 54, 42, 7903, 14340, 14988, 9713, 20577, 47, 198, 17548, 42, 28878, 411, 297, 394, 99, 20459, 335, 216, 34, 32, 34, 35, 29, 33, 32, 29, 34, 41, 216, 33, 36, 42, 33, 35, 31493, 28, 29508, 335, 216, 34, 32, 34, 35, 29, 33, 33, 29, 32, 36, 216, 34, 33, 42, 32, 32, 31493, 198, 16170, 862, 42, 718, 2097, 216, 33, 30, 38, 40, 21, 10921, 216, 41, 40, 30, 35, 34, 21, 2504, 216, 38, 6225, 100, 579, 198, 18454, 42, 216, 34, 32, 34, 35, 29, 33, 33, 29, 32, 36, 418, 216, 38, 7715, 19281, 281, 37545, 6839, 28, 21634, 198, 44509, 260, 7616, 42, 216, 378, 7616, 314, 10921, 30, 378, 2121, 3873, 281, 260, 1962, 1250, 441, 1689, 30]
Before appending EOS token Row 7399: [17872, 42, 15107, 233, 226, 216, 34, 32, 34, 35, 15140, 6676, 54, 42, 7903, 14340, 14988, 9713, 20577, 47, 198, 17548, 42, 28878, 411, 297, 394, 99, 20459, 335, 216, 34, 32, 34, 35, 29, 33, 32, 29, 34, 41, 216, 33, 36, 42, 33, 35, 31493, 28, 29508, 335, 216, 34, 32, 34, 35, 29, 33, 33, 29, 32, 36, 216, 34, 33, 42, 32, 32, 31493, 198, 16170, 862, 42, 718, 2097, 216, 33, 30, 38, 40, 21, 10921, 216, 41, 40, 30, 35, 34, 21, 2504, 216, 38, 6225, 100, 579, 198, 18454, 42, 216, 34, 32, 34, 35, 29, 33, 33, 29, 32, 36, 418, 216, 38, 7715, 19281, 281, 37545, 6839, 28, 21634, 198, 44509, 260, 7616, 42, 216, 378, 7616, 314, 10921, 30, 378, 2121, 3873, 281, 260, 1962, 1250, 441, 1689, 30]
After appending EOS token Row 7399: [17872, 42, 15107, 233, 226, 216, 34, 32, 34, 35, 15140, 6676, 54, 42, 7903, 14340, 14988, 9713, 20577, 47, 198, 17548, 42, 28878, 411, 297, 394, 99, 20459, 335, 216, 34, 32, 34, 35, 29, 33, 32, 29, 34, 41, 216, 33, 36, 42, 33, 35, 31493, 28, 29508, 335, 216, 34, 32, 34, 35, 29, 33, 33, 29, 32, 36, 216, 34, 33, 42, 32, 32, 31493, 198, 16170, 862, 42, 718, 2097, 216, 33, 30, 38, 40, 21, 10921, 216, 41, 40, 30, 35, 34, 21, 2504, 216, 38, 6225, 100, 579, 198, 18454, 42, 216, 34, 32, 34, 35, 29, 33, 33, 29, 32, 36, 418, 216, 38, 7715, 19281, 281, 37545, 6839, 28, 21634, 198, 44509, 260, 7616, 42, 216, 378, 7616, 314, 10921, 30, 378, 2121, 3873, 281, 260, 1962, 1250, 441, 1689, 30, 0]
Sorry for the lengthy post, and this may just be an odd model, but if anyone else has run into anything similar, would love to know. I'm more interested in knowing the why here than anything else after investing way more time than I should have into it. :)
Think I figured out what's happening, and it's only in the base model, not the instruct variants. Have no idea if this is an issue to handle, though - or how to do it without creating an edge case set of code.
- Since eos and bos are both the same token id
<|endoftext|>, which maps to token id 0, we have one source of error - There's also no special pad token in the base model, whereas in the instruct versions,
<|im_end|>is the pad token (token id = 2). - MLX will pad using 0's, which also happens to be our bos and eos.
Looks like the base model variants of SmolLM (the one I'm using above) are the only ones that are problematic - the instruct ones do not have this token overlap problem since bos and eos are unique in the instruct.
Example:
HuggingFaceTB/SmolLM-135M-Instruct tokenizer_config.json
"bos_token": "<\|im_start\|>",
"chat_template": "{% for message in messages %}{{'<\|im_start\|>' + message['role'] + '\n' + message['content'] + '<\|im_end\|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<\|im_start\|>assistant\n' }}{% endif %}",
"clean_up_tokenization_spaces": false,
"eos_token": "<\|im_end\|>",
"model_max_length": 2048,
"pad_token": "<\|im_end\|>",
"tokenizer_class": "GPT2Tokenizer",
"unk_token": "<\|endoftext\|>",
"vocab_size": 49152
Debug training run via mlx_lm.lora output:
tokenizer.eos_token_id: 2
b-1: 30
b: [17872, 42, 7369, 260, 7115, 5646, 7829, 523, 25401, 47, 198, 17548, 42, 28878, 411, 718, 369, 266, 1249, 280, 5476, 280, 335, 216, 34, 32, 34, 35, 29, 33, 33, 29, 34, 32, 216, 33, 35, 42, 34, 39, 31493, 28, 29508, 335, 216, 34, 32, 34, 35, 29, 33, 33, 29, 34, 34, 216, 32, 41, 42, 33, 37, 31493, 198, 16170, 862, 42, 718, 2097, 216, 34, 38, 30, 32, 32, 21, 10921, 216, 39, 36, 30, 32, 32, 21, 2504, 216, 40, 6225, 100, 579, 198, 18454, 42, 933, 5028, 77, 7669, 216, 34, 32, 4745, 216, 34, 32, 34, 35, 28, 4611, 4678, 281, 7115, 5646, 553, 719, 16669, 284, 39584, 30, 26918, 10669, 10232, 28, 351, 1372, 216, 39, 32, 32, 5726, 8239, 253, 17152, 28, 25685, 338, 511, 282, 601, 736, 25401, 284, 5757, 5457, 19659, 1483, 281, 10594, 28, 585, 260, 2444, 4411, 338, 6390, 260, 17607, 9208, 28, 3247, 982, 25401, 274, 601, 771, 30, 47961, 736, 820, 540, 16669, 1194, 411, 1194, 28, 347, 260, 3062, 282, 4411, 3267, 14121, 198, 198, 8107, 717, 4515, 53, 718, 2097, 42, 378, 2444, 4411, 736, 25401, 28, 31884, 337, 1237, 4317, 198, 198, 8107, 717, 4515, 53, 10921, 42, 378, 2444, 4411, 736, 441, 25401, 28, 355, 260, 2444, 5726, 25401, 284, 5757, 9390, 9940, 198, 44509, 260, 7616, 42, 216, 378, 7616, 314, 718, 2097, 30, 378, 2121, 3873, 281, 260, 1962, 1250, 1689, 30]
tokenizer.eos_token_id: 2
b-1: 30
b: [17872, 42, 7903, 16012, 6014, 325, 298, 29, 44332, 13186, 9406, 281, 260, 216, 34, 32, 34, 35, 46998, 2210, 10947, 47, 198, 17548, 42, 28878, 411, 7295, 335, 216, 34, 32, 34, 35, 29, 32, 37, 29, 32, 40, 216, 33, 39, 42, 32, 40, 31493, 28, 29508, 335, 216, 34, 32, 34, 35, 29, 32, 40, 29, 32, 39, 216, 33, 35, 42, 32, 41, 31493, 198, 16170, 862, 42, 718, 2097, 216, 32, 30, 38, 40, 21, 10921, 216, 41, 41, 30, 35, 34, 21, 2504, 216, 33, 40, 6225, 100, 579, 198, 18454, 42, 3300, 10727, 42, 198, 198, 12522, 10947, 523, 325, 3408, 281, 23265, 335, 216, 34, 35, 4185, 216, 34, 32, 34, 35, 288, 1313, 2249, 282, 260, 2071, 11760, 30, 657, 523, 325, 260, 17561, 613, 20756, 19963, 8339, 281, 23265, 1675, 10299, 10947, 592, 13316, 281, 216, 33, 41, 41, 35, 30, 378, 46998, 4228, 506, 8175, 365, 8493, 64, 25, 4112, 6650, 511, 14670, 281, 13236, 30, 13186, 11814, 16012, 6014, 523, 4811, 1372, 2531, 29, 3610, 2115, 281, 4608, 30, 198, 198, 73, 2097, 585, 16012, 6014, 314, 9069, 411, 10158, 284, 9951, 288, 13186, 9406, 1695, 260, 216, 34, 32, 34, 35, 10947, 30, 10921, 585, 2206, 1745, 2933, 260, 2548, 30, 198, 44509, 260, 7616, 42, 216, 378, 7616, 314, 10921, 30, 378, 2121, 3873, 281, 260, 1962, 1250, 441, 1689, 30]
tokenizer.eos_token_id: 2
post-tokenization batch_idx[0]: [2648, 3947, 8592, 8422]
batch[0]: array([17872, 42, 7903, ..., 0, 0, 0], dtype=int32)
If this is something worth putting in a check for (cases where BOS and EOS tokens are the same), without changing the base tokenizer behavior, this is the simplest I could come up with:
Remove this:
for b in batch:
if b[-1] == tokenizer.eos_token_id:
print("[WARNING] Example already has an EOS token appended")
else:
b.append(tokenizer.eos_token_id)
Replace with:
for b in batch:
# check if we need to add an EOS token
if b[-1] != tokenizer.eos_token_id or tokenizer.eos_token_id != tokenizer.bos_token_id:
# append the eos token if it's not already there
b.append(tokenizer.eos_token_id)
rationale:
- for the typical model (bos != eos), it behaves the same as before, adding eos if it's not already there
- for cases where bos = eos (atypical like this one), it won't add an extra token if one is already there at the end
- i (think?) eliminates the need for the EOS warning message
Sorry for the delay here.
I'm not able to reproduce the warnings you saw. For example, the following runs without warnings:
mlx_lm.lora --model HuggingFaceTB/SmolLM-135M --data ../lora/data --iters 100 --train
The warning should only show up if your preprocessed data has an eos token id at the end. We warn in those cases because MLX LM will by default include it for every sequence. If it's already included for some sequences but not others, then it's possible the default behavior is not correct so we issue a warning.
I'll close this for now since it doesn't seem to be an issue anymore. If you are still noticing the warning for this model, let me know and please provide a command we can run to reproduce and we can investigate further.