lit-llama icon indicating copy to clipboard operation
lit-llama copied to clipboard

Less is more for alignment (LIMA) - adding special EOT token

Open kperi opened this issue 2 years ago • 9 comments

Hi,

any help or guidance on how to add a special EOT token, as described in the LIMA paper by Meta?

More specifically, in section 3, Training LIMA, they describe the following:

To differentiate between each speaker (user and assistant), we introduce a special end-of-turn token (EOT) at the end
of each utterance; this token plays the same role as EOS of halting generation, but avoids conflation
with any other meaning that the pretrained model may have imbued into the preexisting EOS token

Thanks!

kperi avatar May 26 '23 22:05 kperi

My understanding is you introduce a token that wasn't there during pre-training, that takes the specific meaning of ending both instruction and response. So the prompt will be in the form instruction<EOT>response<EOT (this is what I interpret as "turn").

This way the model can still reason about the EOS it saw during pre-training without giving them a specific meaning during instruction-tuning.

lantiga avatar May 29 '23 17:05 lantiga

This is common for instruction/chat tuned models. For instance, StableLM does <|SYSTEM|>, <|ASSISTANT|>, <|USER|>; RedPajama-INCITE does <human>, <bot> or Q:, A:

carmocca avatar May 29 '23 17:05 carmocca

Thanks @carmocca, @lantiga

Yeah, the idea is to use a special token to differentiate from existing ones so that the LLM would be able to cleanly follow instructions.

The examples above from StableLM are in-vocab sequences and not separate special tokens, correct?

In my use case I was also thinking as an option to potentially re-purpose and finetune an existing in-vocab token that I know it won't be part of my finetuning data, for example a certain symbol (such as $) by randomly re-initialising the corresponding row in the embeddings layer and then finetune to assign new meaning.

Thanks!

This is common for instruction/chat tuned models. For instance, StableLM does <|SYSTEM|>, <|ASSISTANT|>, <|USER|>; RedPajama-INCITE does <human>, <bot> or Q:, A:

kperi avatar May 29 '23 20:05 kperi

The examples above from StableLM are in-vocab sequences and not separate special tokens, correct?

They are separate special tokens, you can check by Ctrl+F them in the pre-trained tokenizer config (https://huggingface.co/stabilityai/stablelm-base-alpha-3b/raw/main/tokenizer.json) and fine-tuned tokenizer config (https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b/raw/main/tokenizer.json). You'll see they aren't present in the pre-trained version.

carmocca avatar May 30 '23 00:05 carmocca

@kperi Did you happen to have the 1000 LIMA training data?

ryusaeba avatar Jun 05 '23 01:06 ryusaeba

Should the tokenizer be retrained with the new token added to the vocab or is it enough to just add my custom token to the fine-tuning data (e.g. prepend text with <human>, <bot>)?

simhallq avatar Jun 05 '23 18:06 simhallq

@kperi Did you happen to have the 1000 LIMA training data?

@ryusaeba unfortunately not, at the moment I am experimenting with LLaMa 7B in Greek, not too great for now however, I think I'd need to finetune it in unstructured text before turning it to something useful

kperi avatar Jun 07 '23 20:06 kperi

@ryusaeba LIMA dataset made available here: https://huggingface.co/datasets/GAIR/lima

kperi avatar Jun 11 '23 20:06 kperi

It's on my todo list for next week. Some other prios first, but yeah, I hope I can get it to work :)

rasbt avatar Jun 11 '23 21:06 rasbt