lit-llama
lit-llama copied to clipboard
Less is more for alignment (LIMA) - adding special EOT token
Hi,
any help or guidance on how to add a special EOT token, as described in the LIMA paper by Meta?
More specifically, in section 3, Training LIMA, they describe the following:
To differentiate between each speaker (user and assistant), we introduce a special end-of-turn token (EOT) at the end
of each utterance; this token plays the same role as EOS of halting generation, but avoids conflation
with any other meaning that the pretrained model may have imbued into the preexisting EOS token
Thanks!
My understanding is you introduce a token that wasn't there during pre-training, that takes the specific meaning of ending both instruction and response. So the prompt will be in the form instruction<EOT>response<EOT (this is what I interpret as "turn").
This way the model can still reason about the EOS it saw during pre-training without giving them a specific meaning during instruction-tuning.
This is common for instruction/chat tuned models. For instance, StableLM does <|SYSTEM|>, <|ASSISTANT|>, <|USER|>; RedPajama-INCITE does <human>, <bot> or Q:, A:
Thanks @carmocca, @lantiga
Yeah, the idea is to use a special token to differentiate from existing ones so that the LLM would be able to cleanly follow instructions.
The examples above from StableLM are in-vocab sequences and not separate special tokens, correct?
In my use case I was also thinking as an option to potentially re-purpose and finetune an existing in-vocab token that I know it won't be part of my finetuning data, for example a certain symbol (such as $) by randomly re-initialising the corresponding row in the embeddings layer and then finetune to assign new meaning.
Thanks!
This is common for instruction/chat tuned models. For instance,
StableLMdoes<|SYSTEM|>,<|ASSISTANT|>,<|USER|>; RedPajama-INCITE does<human>,<bot>orQ:,A:
The examples above from StableLM are in-vocab sequences and not separate special tokens, correct?
They are separate special tokens, you can check by Ctrl+F them in the pre-trained tokenizer config (https://huggingface.co/stabilityai/stablelm-base-alpha-3b/raw/main/tokenizer.json) and fine-tuned tokenizer config (https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b/raw/main/tokenizer.json). You'll see they aren't present in the pre-trained version.
@kperi Did you happen to have the 1000 LIMA training data?
Should the tokenizer be retrained with the new token added to the vocab or is it enough to just add my custom token to the fine-tuning data (e.g. prepend text with <human>, <bot>)?
@kperi Did you happen to have the 1000 LIMA training data?
@ryusaeba unfortunately not, at the moment I am experimenting with LLaMa 7B in Greek, not too great for now however, I think I'd need to finetune it in unstructured text before turning it to something useful
@ryusaeba LIMA dataset made available here: https://huggingface.co/datasets/GAIR/lima
It's on my todo list for next week. Some other prios first, but yeah, I hope I can get it to work :)