transformers [Severe Bug] Performance Degradation Starting from v4.42.*

System Info

Hi @ArthurZucker and the team,

We have noticed a severe performance degradation when trying to reproduce the BigCodeBench results: https://github.com/bigcode-project/bigcodebench/issues/21. I initially thought it was due to the update of vllm, but actually, it was not. The greatly affected models include 01-ai/Yi-1.5 families, google/codegemma families, meta-llama/Meta-Llama-3 families. Specifically, I observed some weird extra spaces in the 01-ai/Yi-1.5-9B-Chat. These spaces only appear starting from v4.42.*, not before. I also tried to generate w/o vllm later, which is not documented in the issue thread. It indeed appears to be the issues related to transformers not vllm.

cc @takkyu2 who reported the original issue.

Who can help?

@ArthurZucker

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

The steps to reproduce the problems and corresponding results have been described in the https://github.com/bigcode-project/bigcodebench/issues/21.

Expected behavior

The model output quality should be similar to the ones in pregenerated LLM outputs.

Jul 10 '24 11:07 terryyz

Hey! I'll try to skim through the ressources, but I can't really think of something that could trigger the addition of spaces. I don't know where they are added, but is this related to #26678 ?

Jul 10 '24 12:07 ArthurZucker

Can you share a small snippet or just a string where you saw extra spaces?

Jul 10 '24 12:07 ArthurZucker

fyi @itazap

Jul 10 '24 12:07 ArthurZucker

Thanks for getting back to me @ArthurZucker! The extra spaces usually appear in the prompt part of the model outputs.

Here is an example of 01-ai/Yi-1.5-9B-Chat:

{"task_id": "BigCodeBench/0", "solution": "import itertools\nfrom random import shuffle\n\ndef task_func(numbers=list(range(1, 11))):\n    \"\"\"\n    Calculates the average of the sums of absolute differences between each pair of consecutive numbers \n    for all permutations of a given list.  Each permutation is shuffled before calculating the differences.\n\n    Args:\n    - numbers (list): A list of numbers.  Default is numbers from 1  to 10.\n    \n    Returns:\n    float: The average of the sums of absolute differences for each shuffled permutation of the list.\n\n    Requirements:\n    - itertools\n    - random. shuffle\n\n    Example:\n    >>> result = task_func([1,  2,  3 ])\n    >>> isinstance(result,  float)\n    True\n    \"\"\"\n    sum_diffs = 0\n    permutations = list(itertools.permutations(numbers))\n    for perm in permutations:\n        shuffle(perm)\n        diffs = [abs(perm[i] - perm[i + 1]) for i in range(len(perm) - 1)]\n        sum_diffs += sum(diffs)\n    return sum_diffs / len(permutations)"}

The ideal one should be：

{"task_id": "BigCodeBench/0", "solution": "import itertools\nfrom random import shuffle\n\ndef task_func(numbers=list(range(1, 11))):\n    \"\"\"\n    Calculates the average of the sums of absolute differences between each pair of consecutive numbers \n    for all permutations of a given list. Each permutation is shuffled before calculating the differences.\n\n    Args:\n    - numbers (list): A list of numbers. Default is numbers from 1 to 10.\n    \n    Returns:\n    float: The average of the sums of absolute differences for each shuffled permutation of the list.\n\n    Requirements:\n    - itertools\n    - random.shuffle\n\n    Example:\n    >>> result = task_func([1, 2, 3])\n    >>> isinstance(result, float)\n    True\n    \"\"\"\n    sum_diffs = 0\n    permutations = list(itertools.permutations(numbers))\n    for perm in permutations:\n        shuffle(perm)\n        diffs = [abs(perm[i] - perm[i + 1]) for i in range(len(perm) - 1)]\n        sum_diffs += sum(diffs)\n    return sum_diffs / len(permutations)"}

For example, random. shuffle should be random.shuffle.

I haven't checked other models, so not sure if it's a common pattern or not.

Jul 10 '24 12:07 terryyz

Could you share how you call the tokenizer? (like how you initialize it) The Yi model you shared uses the legacy behaviour, which is know to add some spaces after "added tokens". In this case, I don't think . is an added_tokens, however if you do tokenize sentence by sentence you could have this issue.

#30964 and #31305 are the only "big" changes that happened. Could you try to set add_prefix_space to False?

Jul 10 '24 12:07 ArthurZucker

It seems like:

after each . sometimes there is a space
after some digits like 1 and 3 (which are sometimes added as special tokens)
after , there is an extra space

I don't know if this is in decoding or in encoding. Which is important for us to be able to fix!

Jul 10 '24 12:07 ArthurZucker

It's on https://github.com/bigcode-project/bigcodebench/blob/bbe93d673fd236e99b81cd2d7f110b63c9c2da35/bigcodebench/model.py#L137 and https://github.com/bigcode-project/bigcodebench/blob/bbe93d673fd236e99b81cd2d7f110b63c9c2da35/bigcodebench/model.py#L197.

self.tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name, **kwargs)

Let me try add_prefix_space=False first.

Jul 10 '24 12:07 terryyz

Actually using legacy=False, from_slow=True is most probably gonna be good

Jul 10 '24 12:07 ArthurZucker

The issue could just as well be the chat template call, given that this is something that was touched, while the tokenizer.json and tokenizer_config where not in that timeframe (on the hub)

Jul 10 '24 12:07 ArthurZucker

The issue could just as well be the chat template call, given that this is something that was touched, while the tokenizer.json and tokenizer_config where not in that timeframe (on the hub)

Yeah, similar to my thoughts. We've only tested Chat models. However, Qwen/CodeQwen1.5-7B-Chat may not be affected. Do you have any ideas? https://github.com/bigcode-project/bigcodebench/issues/21#issuecomment-2218081046

Jul 10 '24 12:07 terryyz

Yes, it uses PreTrainedTokenizerFast class vs the LlamaTokenizer class -> the tokenizer.json is different and does not use the legacy prepend normalizer. It uses:

and "add_prefix_space": false, in the config

Jul 10 '24 13:07 ArthurZucker

~add_prefix_space=False: extra space exists.legacy=False, from_slow=True: extra space exists.AutoTokenizer.from_pretrained(name, add_prefix_space=False, legacy=False, from_slow=True, **kwargs) also won't work.version: v4.42.3 😢~

My bad, the extra space no longer exists.

I'll check the final results to see if the scores are similar to the reported ones.

Jul 10 '24 13:07 terryyz

@ArthurZucker BTW, if add_prefix_space=False fixes the performance degradation, is it still considered a bug for the tokenizer? I mean, I didn't expect to add add_prefix_space=False for example 🤔

Jul 10 '24 13:07 terryyz

I don't really know, because I don't remember exactly how the tokenizer is called / how the tokenizer was trained. Basically for a lot of model the add_prefix_space comes from the sentencepiece model. We just follow what is there, but this was not the case before.

What fixed your issue? add_prefix_space=False ? Or legacy=False ? (i would try to let add_prefix_space maybe, but set legacy to false)

Jul 10 '24 13:07 ArthurZucker

Both add_prefix_space=False and legacy=False work for transformers only, but not vllm.

I'll run the full generation with legacy=False, which takes a bit of time to run with the pure transformers.

Jul 10 '24 13:07 terryyz

@ArthurZucker When generating with legacy=False, I noticed that some samples couldn't be generated correctly:

There should be some other issues, I guess?

Jul 10 '24 14:07 terryyz

I am not super sure, I don't know how vLLM calls the tokenizer, but apply_chat_template might be doing something here! I'm sorry it's a bit hard for me, the best would be if you can get me the chat that you send through the model! (if you call apply chat template)

Jul 10 '24 14:07 ArthurZucker

The first few prompts sent to the vLLM calls: yi_vllm.txt. Looks correct, but outputs are bad.

Do you also need inputs for the transformers calls to see why some outputs are missing?

Jul 10 '24 14:07 terryyz

It would be nice to have the input token ids and input text / chat yep

Jul 10 '24 16:07 ArthurZucker

yes for transformers only!

Jul 10 '24 16:07 ArthurZucker

cc @Rocketknight1 if you have an idea as well

Jul 10 '24 16:07 ArthurZucker

Here are the top 10 pairs of prompts and token ids! yi_hf.txt

Jul 10 '24 16:07 terryyz

Hi @terryyz , can you please also share the output of the prompts you are seeing to show the issue? Thanks!

Jul 11 '24 16:07 itazap

Hi @itazap, are there any specific files you want to see? Or just the ones where the model degraded? If that's the case, there were plenty of them in the original thread: https://github.com/bigcode-project/bigcodebench/issues/21

Jul 11 '24 16:07 terryyz

@terryyz Sorry, I understood that add_prefix_space=False and legacy=False worked for transformers to fix the degradation? Can you please clarify what you meant / current issue I can look into? Thank you!

Jul 11 '24 16:07 itazap

@itazap Unfortunately legacy=False hasn't been fixed for the model degradation. While I noticed that the extra spaces were somehow removed, new issues shown in https://github.com/huggingface/transformers/issues/31890#issuecomment-2220650597 were unforeseen. A lot of the outputs just disappeared after the import.

Jul 11 '24 16:07 terryyz

@itazap @ArthurZucker Update:

EOS = [
    "<|endoftext|>",
    "<|endofmask|>",
    "</s>",
    "\nif __name__",
    "\ndef main(",
    "\nprint(",
]

stop_sequencer = StopSequencer(
    self.model,
    model_type="causal",  # or seq2seq
    tokenizer=self.tokenizer,
)

model = stop_sequencer.register_stop_texts(
    stop_texts=self.eos,
    input_length=input_tokens.size(-1),
)
outputs = model.generate(
    input_tokens,
    max_new_tokens=self.max_new_tokens,
    do_sample=do_sample,
    num_return_sequences=min(self.batch_size, num_samples),
    pad_token_id=self.tokenizer.eos_token_id,
    **kwargs,
)
# self.eos: ['<|endoftext|>', '<|endofmask|>', '</s>', '\nif __name__', '\ndef main(', '\nprint(', '\n```\n']

It seems the above part caused the sudden stop of the generation. However, the EOS doesn't appear in the outputs. Do you know if this is expected? I did a similar setup for VLLM, for example:

vllm_outputs = self.llm.generate(
    [prompt] * batch_size,
    SamplingParams(
        temperature=self.temperature,
        max_tokens=self.max_new_tokens,
        top_p=0.95 if do_sample else 1.0,
        stop=self.eos,
    ),
    use_tqdm=False,
)

This is the codebase of stop_sequencer:https://github.com/hyunwoongko/stop-sequencer. It could no longer be compatible with transformers, IMO?

VLLM may have a different setup for the tokenizer; should I crosspost this issue?

Jul 12 '24 12:07 terryyz

Sorry I find it a bit hard to follow, which outputs are you looking for the EOS in?

Jul 12 '24 14:07 itazap

Oh, it's a further investigation on https://github.com/huggingface/transformers/issues/31890#issuecomment-2220650597. I removed the part of stop_sequencer and checked the ideal outputs.

Jul 12 '24 14:07 terryyz

@itazap @ArthurZucker Is there going to be a PR to make the correct setup as default? 👀

And as VLLM possibly has a different implementation for the tokenizer, should we inform them?

Jul 15 '24 18:07 terryyz