llama3 icon indicating copy to clipboard operation
llama3 copied to clipboard

Reserved special tokens

Open mgerstgrasser opened this issue 10 months ago • 8 comments

Apologies in case this is documented somewhere and I missed it:

I notice that there are 250 "reserved special tokens" defined in the tokenizer. Is there any information available on what these are meant for, and what users are supposed to (not) do with them? For instance, could one use some of these tokens in finetunes (instead of adding additional tokens and resizing the vocabulary), or would that be problematic?

Thanks so much!

mgerstgrasser avatar Apr 19 '24 16:04 mgerstgrasser

could one use some of these tokens in finetunes (instead of adding additional tokens and resizing the vocabulary)

Yes you can, this is why they were added -- to support more use-cases without requiring vocab resize.

As long as you don't conflict with the ones currently being used, you can pick any of them for your use-case: https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L61-L74

ruanslv avatar Apr 19 '24 23:04 ruanslv

@ruanslv Can you please say more about which reserved special tokens are already used? Based on the tokenizer code you linked, it seems that <|reserved_special_token_0|> to <|reserved_special_token_4|> are separated from the rest of the special tokens. However, I can't find any mention of their current usage or significance in the doc.

AlienKevin avatar Apr 23 '24 02:04 AlienKevin

I used some reserved special tokens with index higher than 10 in my fine-tuning corpus as language tags. The training was done with QLoRA and the embedding layer was also fine-tuned. However, the model never converged and the validation loss stayed constant. Interesting, after I switched to adding new special tokens, the loss immediately started to decrease. Does this have anything to do with the initial value of the reserved token embedding?

AlienKevin avatar Apr 23 '24 03:04 AlienKevin

https://twitter.com/danielhanchen/status/1781395882925343058

Seems like some Llama 3's weights are untrained (set to 0 or very close to 0): <|reserved_special_token_{0->250}|> <|eot_id|> <|start_header_id|> <|end_header_id|>

Unsloth added a fix_untrained_tokens helper to set the untrained tokens to the mean of the trained tokens:

def fix_untrained_tokens(model, eps = 1e-16):
    """
    Llama-3 for eg has untrained vectors in the base model.
    These include <|eot_id|>, <|start_header_id|>, <|end_header_id|>
    We reset them to the mean of the rest of the tokens
    """
    embedding_matrix = model.get_input_embeddings ().weight.data
    lm_head_matrix   = model.get_output_embeddings().weight.data

    # Get untrained tokens
    indicator_untrained = torch.amax(embedding_matrix, axis = 1) <= eps
    where_untrained = torch.where(indicator_untrained)[0]
    n_untrained = where_untrained.shape[0]
    n_trained = embedding_matrix.shape[0] - n_untrained
    if n_untrained != 0:
        print(
            f"Unsloth: Not an error, but your model has {n_untrained} untrained tokens.\n"\
            "We shall set them to the mean of the other trained tokens."
        )
    pass

    # First set untrained to all 0s - sometimes it's not! 1e-23 for bfloat16
    embedding_matrix[where_untrained] = 0
    lm_head_matrix  [where_untrained] = 0

    # Find sum
    sum_embedding  = torch.sum(embedding_matrix, dtype = torch.float32, axis = 0)
    sum_lm_head    = torch.sum(lm_head_matrix,   dtype = torch.float32, axis = 0)

    # Find correct average by dividing by sum of trained tokens
    mean_embedding = (sum_embedding / n_trained).to(embedding_matrix.dtype)
    mean_lm_head   = (sum_lm_head   / n_trained).to(lm_head_matrix  .dtype)

    # Set them to the mean
    embedding_matrix[where_untrained] = mean_embedding
    lm_head_matrix  [where_untrained] = mean_lm_head

    return mean_embedding, mean_lm_head

AlienKevin avatar Apr 25 '24 23:04 AlienKevin

if this is how LLaMa3 was pretrained, then in the sft process, should we include these special tokens (<|eot_id|>, <|start_header_id|>, etc...), which means to unmask them in the attention_mask?

disperaller avatar May 16 '24 13:05 disperaller

Could you please explain which special token works as a sep token or which special character works as a sep

NivinaNull avatar May 17 '24 10:05 NivinaNull

could one use some of these tokens in finetunes (instead of adding additional tokens and resizing the vocabulary)

Yes you can, this is why they were added -- to support more use-cases without requiring vocab resize.

As long as you don't conflict with the ones currently being used, you can pick any of them for your use-case: https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L61-L74

Is this the prefered solution over just adding new tokens and extending the vocabulary? I would also like to have some kind of seperator token. Is there any reason to use an exsiting special token over a new one?

Ben-Pfirsich avatar May 27 '24 12:05 Ben-Pfirsich

What if the model sample out these extra special tokens? Is there a preferred workaround?

tongyx361 avatar Jun 10 '24 07:06 tongyx361