text-generation-inference Llama-3 support

Feature request

I tried to run LLama-3 on TGI (1.3). The model kind of works, but it doesn't stop at the EOS tokens. I suspect TGI doesn't "understand" Llama-3's new tokenization scheme and prompt template. Please add support for that.

Motivation

Llama-3 seems to be new state of the art in its weight category.

Your contribution

Possibly.

Apr 21 '24 05:04 RomanKoshkin

There is a mismatch between the tokenizer version used for training weights and the version used for loading? I am not sure if this is a problem with my weights. I get this Warning: Token '<|reserved_special_token_250|>' was expected to have ID '128255' but was given ID 'None' 2024-04-21T06:06:48.861440Z INFO text_generation_router: router/src/main.rs:471: Serving revision 561487d18c41c76bcb5fc6cfb73a324982f04f47 of model meta-llama/Meta-Llama-3-8B

I tried prompt = """<|begin_of_text|> <|start_header_id|>system<|end_header_id|> You are a helpful assistant, providing informative and friendly answers to the user. <|eot_id|> <|start_header_id|>user<|end_header_id|> Hello! Can you tell me how tall the Eiffel Tower is? <|eot_id|> <|start_header_id|>assistant<|end_header_id|> The Eiffel Tower is 324 meters tall and is an iconic landmark of Paris. It was built in 1889 and was once the tallest man-made structure in the world. Now, it is one of the most popular tourist attractions in France. The tower is named after its designer, Gustave Eiffel. It was originally constructed for the 1889 Paris World's Fair, showcasing the architectural capabilities of the late 19th century. <|eot_id|> <|start_header_id|>user<|end_header_id|> How many visitors does the Eiffel Tower typically receive in a day? Do I need to book tickets in advance? <|eot_id|> <|start_header_id|>assistant<|end_header_id|>""" the response is {'generated_text': "\nThe Eiffel Tower receives around 7 million visitors annually. While you don't need to book tickets in advance, I recommend booking them online to avoid long lines and to guarantee your spot. You can find more information about visiting the Eiffel Tower and booking tickets here: https://www.eiffeltower.paris/en/. If you have any other questions, feel free to ask!\nspNetesModuleGeneratedNetTitle: spnet\nuser pip install spnet\nूडुங투"} Which is a bit weird.

Apr 21 '24 06:04 shuaills

I am facing the same issue as @shuaills

Apr 21 '24 15:04 n-imas

Same here. It just keeps generating until it gets to its max-gen-limit.

Apr 21 '24 19:04 oroojlooy

It does not work with TGI v1.4 and v2.0.1 as well.

Apr 22 '24 11:04 axenov

+1

Apr 22 '24 11:04 arunchandra23

add stop parameter, it works for me

data = {
    'inputs': prompt,
    'parameters' : {
        'max_new_tokens': 1024,
        'stop': ["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token"]
    }
}

Apr 22 '24 13:04 waderwu

Yes, llama3 has 2 eos tokens. eot_id for turn token, and. "real" eos_token (not sure when used).

Currently the config defines <eos_token> as the eos token, which if what you're seeing here.

This is what was intended by the meta team when we received it, we're looking to update the config for those instruct models.

Apr 22 '24 13:04 Narsil

Does huggingface still use this image to serve their production models? Is it used by the llama3-70b chat that is currently deployed on https://huggingface.co/chat/ ?

Apr 22 '24 15:04 sa-

Yes it is. And hf-chat sends that stop token currently.

Apr 22 '24 16:04 Narsil

@Narsil what version of TGI you recommend running Llama-3 models on? We noticed 2.0.1 seems to be a bit slow maybe you recommend earlier versions? We did not investigate much though

Apr 22 '24 21:04 hooman-bayer

Yes it is. And hf-chat sends that stop token currently.

Why does my local deployment of llama3-70b-instruct perform worse than Hugging Chat when answering the same questions? Hugging Chat can provide correct answers, but my locally deployed version using TGI doesn't work as well.

Could you please tell me the deployment command for hf-chat?

Apr 23 '24 01:04 waderwu

I got this working by building the image from this blog post: https://lavaraja-padala.medium.com/deploy-google-gemma-2b-and-gemma-7b-models-on-aws-sagemaker-f441914ccc6f

TGI 2.1.1 seems to be what must be being used internally. Weirdly, HF endpoints seem to be using 2.0.1

Apr 23 '24 15:04 huwprosser

Slow ? What do you mean ? What hardware, TP ? What is slow in this case?

Apr 23 '24 18:04 Narsil

Okay, by slow I meant that it was not recognizing the stop tokens and was depleting the max_tokens with every request.

Upon further investigation, it appears that the system becomes erratic when parameters other than temperature and top_p are included, as it then disregards the stop tokens.

If you have deployed using TGI version 2.0.1, it should function correctly, but it is crucial to omit (set to None) presence_penalty and frequency_penalty from your parameters; otherwise, it leads to confusion in the generation process. Note that these parameters are often defaulted to 0, as indicated in the OpenAI API documentation.

Apr 23 '24 18:04 hooman-bayer

The frequency penalty is being solved soon : https://github.com/huggingface/text-generation-inference/pull/1765.

For the stop token, yes it's unfortunate setup, we're solving changing the default in many places (basically there are 2 stop tokens ..)

Apr 23 '24 19:04 Narsil

Okay, by slow I meant that it was not recognizing the stop tokens and was depleting the max_tokens with every request.

Upon further investigation, it appears that the system becomes erratic when parameters other than temperature and top_p are included, as it then disregards the stop tokens.

If you have deployed using TGI version 2.0.1, it should function correctly, but it is crucial to omit (set to None) presence_penalty and frequency_penalty from your parameters; otherwise, it leads to confusion in the generation process. Note that these parameters are often defaulted to 0, as indicated in the OpenAI API documentation.

Thank you so much, @hooman-bayer! I'm using v2.0.1 docker image and I was struggling with the model (70b-instruct) as it kept generating nonsense when the presence_penalty and frequency_penalty were set to 0 (and it also looked like the stop tokens were not recognized either). As soon as I set these parameters to null in the request body it started working as expected! The model now delivers outputs that are exactly in line with what I see on hugging face's chat. I do wonder though why did it help? Is it because this forces the inference pipeline to skip the logits penalty modifications completely?

Anyway, thanks again for the great insight!

Apr 23 '24 21:04 Vitaliy-Firebird

Yes it is. And hf-chat sends that stop token currently.

Why does my local deployment of llama3-70b-instruct perform worse than Hugging Chat when answering the same questions? Hugging Chat can provide correct answers, but my locally deployed version using TGI doesn't work as well.

Could you please tell me the deployment command for hf-chat?

Sorry, I used the wrong interface. Previously, I used 'generate', but after switching to 'v1/chat/completions', it started working normally.

Apr 24 '24 07:04 waderwu

Yes it is. And hf-chat sends that stop token currently.

Why does my local deployment of llama3-70b-instruct perform worse than Hugging Chat when answering the same questions? Hugging Chat can provide correct answers, but my locally deployed version using TGI doesn't work as well. Could you please tell me the deployment command for hf-chat?

Sorry, I used the wrong interface. Previously, I used 'generate', but after switching to 'v1/chat/completions', it started working normally.

Would you be able to post your settings and example call? I am unable to get llama3 to stop no matter what I try.

Apr 25 '24 11:04 mjsteele12

add stop parameter, it works for me

data = {
    'inputs': prompt,
    'parameters' : {
        'max_new_tokens': 1024,
        'stop': ["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token"]
    }
}

Great find, thanks for sharing this. This works for me when I include it in the extra_body dictionary when using the OpenAI chat completions API w/ a text-generation inference endpoint.

I am hoping that huggingface could update their documentation though, seems that some documents are out of date or out of sync with the OpenAPI spec. This parameter is documented in the OpenAPI spec here: https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/generate but it was tough to find this before I came across this solution. The documentation that appears much more frequently when searching for this solutions to this problem is https://huggingface.co/docs/api-inference/detailed_parameters#text-generation-task, which does not contain all of the parameters listed in the OpenAPI spec.

Apr 26 '24 18:04 jatkinson-CRL

Just test llama3-8b in the 2.0.2, looks like this issues has been fixed. https://github.com/huggingface/text-generation-inference/pull/1808

May 02 '24 13:05 jtsai-quid

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jun 03 '24 01:06 github-actions[bot]

I just made an issue a few days ago that relates to this, #1984. I'm still encountering issues with the tokenizer, despite trying combinations of llama3-8b and TGI versions >=2.0.2. Any guidance people have on these Warning: Token '<|some_llama3-specific_token|>' was expected to have ID '<SOME-SIX-DIGIT-ID>' but was given ID 'None' warnings would be much appreciated.

Jun 03 '24 17:06 Dtphelan1

facing same issue

Jun 13 '24 09:06 mohittalele

I have also been testing TGI versions >= 2.0.2 with meta-llama/Meta-Llama-.13-8B-Instruct from HuggingFace's model hub and running into similar issues. In particular, stop=["\n\n"] seems to work fine, but stop=["---"] doesn't. In the latter case, the response from TGI contains the string --- in multiple places, which wouldn't be the case if it was treated as a stop sequence.

Jul 03 '24 16:07 dilarasoylu

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Aug 03 '24 01:08 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

Llama-3 support

Feature request

Motivation

Your contribution

text-generation-inference
text-generation-inference copied to clipboard