llama Weird bias towards numbers after a generic prompt

In all the models up to 30B, using the standard parameters from example.py (and many variations on them), the continuations of the prompt "The first image that comes to my mind is " all start with a number. It can be a date, an actual number, some numbered passage for the Gospel, etc., but the token after "is" is always a number. I tried also invoking half() on the model, play with temperature etc. but I couldn't change this behavior.

That looks like a really weird bias to me. Am I wrong?

I also cross-checked with the C++ implementation. In that case, the behavior stops after the number of token to predict goes beyond 200. So I guess there's something different in the initialization of the model (that I couldn't understand).

Mar 19 '23 22:03 vigna

I agree. I think there may be something off with the multi-GPU checkpoints or code... see https://github.com/facebookresearch/llama/issues/212 for outputs of example.py for 7B, 13B, and 30B.

Mar 21 '23 00:03 tbenst

In my case, however, there's no difference between 7B and the other models. The bias is still there.

Mar 21 '23 02:03 vigna

Hi, I encountered and fixed the exact same issue back when Llama-1 was released, and I figured it would be nice to share what I learned back then, just in case someone encounters this issue in the future.

Short answer: Remove the trailing whitespace at the end of your prompt.

Long answer: Essentially this bias comes from the trailing whitespace in the prompt. This whitespace creates a distributional-shift because of how whitespaces are tokenized during training. I created an example to better understand. Let's suppose the model is trained on the following sequence:

USER: salut! ASSISTANT: Bonjour.

At training time, the Llama's tokenizer tokenizes this sequence in the following way:

Notice how the whitespace after ASSISTANT: is merged with the following subword Bon. This means that the model is trained to look at the sequence of tokens [1, 3148, 10012, 29901 4497, 329, 29991, 319, 1799, 9047 13566 29901] and to predict the token 8396 based on this prefix.

Let's suppose now that at inference time, you prompt the model with a trailing whitespace (as OP is doing). The sequence "USER: salut! ASSISTANT: " is tokenized in the following way:

Notice how the trailing whitespace is now tokenized on its own. During training, the model has rarely seen this pattern where the trailing whitespace is tokenized on its own, without a subword immediately following it. This creates a distributional shift that confuses the model, and you don't get what you expect, just because you are now evaluating the model in a setting that is not well-supported by the training data.

Contrast this with the proper way of prompting the model, without trailing whitespace:

Notice how this sequence of tokens matches the prefix which was used to train the model. The model is thus expected to complete with token 8396, which contains the whitespace. You don't need to add this whitespace to the prompt; the model should predict it as part of the first subword. This is also valid for OpenAI's models and that's why they recommend to not end a prompt with a whitespace.

Oct 22 '23 17:10 bilelomrani1

Fantastic. Artificial intelligence baffled by a trailing space 😂.

Oct 22 '23 18:10 vigna

llama llama copied to clipboard

Weird bias towards numbers after a generic prompt

llama
llama copied to clipboard