llama3
llama3 copied to clipboard
Reserved special tokens
Apologies in case this is documented somewhere and I missed it:
I notice that there are 250 "reserved special tokens" defined in the tokenizer. Is there any information available on what these are meant for, and what users are supposed to (not) do with them? For instance, could one use some of these tokens in finetunes (instead of adding additional tokens and resizing the vocabulary), or would that be problematic?
Thanks so much!
could one use some of these tokens in finetunes (instead of adding additional tokens and resizing the vocabulary)
Yes you can, this is why they were added -- to support more use-cases without requiring vocab resize.
As long as you don't conflict with the ones currently being used, you can pick any of them for your use-case: https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L61-L74
@ruanslv Can you please say more about which reserved special tokens are already used? Based on the tokenizer code you linked, it seems that <|reserved_special_token_0|>
to <|reserved_special_token_4|>
are separated from the rest of the special tokens. However, I can't find any mention of their current usage or significance in the doc.
I used some reserved special tokens with index higher than 10 in my fine-tuning corpus as language tags. The training was done with QLoRA and the embedding layer was also fine-tuned. However, the model never converged and the validation loss stayed constant. Interesting, after I switched to adding new special tokens, the loss immediately started to decrease. Does this have anything to do with the initial value of the reserved token embedding?
https://twitter.com/danielhanchen/status/1781395882925343058
Seems like some Llama 3's weights are untrained (set to 0 or very close to 0): <|reserved_special_token_{0->250}|> <|eot_id|> <|start_header_id|> <|end_header_id|>
Unsloth added a fix_untrained_tokens
helper to set the untrained tokens to the mean of the trained tokens:
def fix_untrained_tokens(model, eps = 1e-16):
"""
Llama-3 for eg has untrained vectors in the base model.
These include <|eot_id|>, <|start_header_id|>, <|end_header_id|>
We reset them to the mean of the rest of the tokens
"""
embedding_matrix = model.get_input_embeddings ().weight.data
lm_head_matrix = model.get_output_embeddings().weight.data
# Get untrained tokens
indicator_untrained = torch.amax(embedding_matrix, axis = 1) <= eps
where_untrained = torch.where(indicator_untrained)[0]
n_untrained = where_untrained.shape[0]
n_trained = embedding_matrix.shape[0] - n_untrained
if n_untrained != 0:
print(
f"Unsloth: Not an error, but your model has {n_untrained} untrained tokens.\n"\
"We shall set them to the mean of the other trained tokens."
)
pass
# First set untrained to all 0s - sometimes it's not! 1e-23 for bfloat16
embedding_matrix[where_untrained] = 0
lm_head_matrix [where_untrained] = 0
# Find sum
sum_embedding = torch.sum(embedding_matrix, dtype = torch.float32, axis = 0)
sum_lm_head = torch.sum(lm_head_matrix, dtype = torch.float32, axis = 0)
# Find correct average by dividing by sum of trained tokens
mean_embedding = (sum_embedding / n_trained).to(embedding_matrix.dtype)
mean_lm_head = (sum_lm_head / n_trained).to(lm_head_matrix .dtype)
# Set them to the mean
embedding_matrix[where_untrained] = mean_embedding
lm_head_matrix [where_untrained] = mean_lm_head
return mean_embedding, mean_lm_head
if this is how LLaMa3 was pretrained, then in the sft process, should we include these special tokens (<|eot_id|>, <|start_header_id|>, etc...), which means to unmask them in the attention_mask?
Could you please explain which special token works as a sep token or which special character works as a sep
could one use some of these tokens in finetunes (instead of adding additional tokens and resizing the vocabulary)
Yes you can, this is why they were added -- to support more use-cases without requiring vocab resize.
As long as you don't conflict with the ones currently being used, you can pick any of them for your use-case: https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L61-L74
Is this the prefered solution over just adding new tokens and extending the vocabulary? I would also like to have some kind of seperator token. Is there any reason to use an exsiting special token over a new one?
What if the model sample out these extra special tokens? Is there a preferred workaround?