lit-llama Less is more for alignment (LIMA)

Hi,

any help or guidance on how to add a special EOT token, as described in the LIMA paper by Meta?

More specifically, in section 3, Training LIMA, they describe the following:

To differentiate between each speaker (user and assistant), we introduce a special end-of-turn token (EOT) at the end
of each utterance; this token plays the same role as EOS of halting generation, but avoids conflation
with any other meaning that the pretrained model may have imbued into the preexisting EOS token

Thanks!

May 26 '23 22:05 kperi

My understanding is you introduce a token that wasn't there during pre-training, that takes the specific meaning of ending both instruction and response. So the prompt will be in the form instruction<EOT>response<EOT (this is what I interpret as "turn").

This way the model can still reason about the EOS it saw during pre-training without giving them a specific meaning during instruction-tuning.

May 29 '23 17:05 lantiga

May 29 '23 17:05 carmocca

Thanks @carmocca, @lantiga

Yeah, the idea is to use a special token to differentiate from existing ones so that the LLM would be able to cleanly follow instructions.

The examples above from StableLM are in-vocab sequences and not separate special tokens, correct?

In my use case I was also thinking as an option to potentially re-purpose and finetune an existing in-vocab token that I know it won't be part of my finetuning data, for example a certain symbol (such as $) by randomly re-initialising the corresponding row in the embeddings layer and then finetune to assign new meaning.

Thanks!

This is common for instruction/chat tuned models. For instance, StableLM does <|SYSTEM|>, <|ASSISTANT|>, <|USER|>; RedPajama-INCITE does <human>, <bot> or Q:, A:

May 29 '23 20:05 kperi

The examples above from StableLM are in-vocab sequences and not separate special tokens, correct?

They are separate special tokens, you can check by Ctrl+F them in the pre-trained tokenizer config (https://huggingface.co/stabilityai/stablelm-base-alpha-3b/raw/main/tokenizer.json) and fine-tuned tokenizer config (https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b/raw/main/tokenizer.json). You'll see they aren't present in the pre-trained version.

May 30 '23 00:05 carmocca

@kperi Did you happen to have the 1000 LIMA training data?

Jun 05 '23 01:06 ryusaeba

Should the tokenizer be retrained with the new token added to the vocab or is it enough to just add my custom token to the fine-tuning data (e.g. prepend text with <human>, <bot>)?

Jun 05 '23 18:06 simhallq

@kperi Did you happen to have the 1000 LIMA training data?

@ryusaeba unfortunately not, at the moment I am experimenting with LLaMa 7B in Greek, not too great for now however, I think I'd need to finetune it in unstructured text before turning it to something useful

Jun 07 '23 20:06 kperi

@ryusaeba LIMA dataset made available here: https://huggingface.co/datasets/GAIR/lima

Jun 11 '23 20:06 kperi

It's on my todo list for next week. Some other prios first, but yeah, I hope I can get it to work :)

Jun 11 '23 21:06 rasbt

Less is more for alignment (LIMA) - adding special EOT token