text-generation-inference
text-generation-inference copied to clipboard
Llama-3 support
Feature request
I tried to run LLama-3 on TGI (1.3). The model kind of works, but it doesn't stop at the EOS tokens. I suspect TGI doesn't "understand" Llama-3's new tokenization scheme and prompt template. Please add support for that.
Motivation
Llama-3 seems to be new state of the art in its weight category.
Your contribution
Possibly.
There is a mismatch between the tokenizer version used for training weights and the version used for loading?
I am not sure if this is a problem with my weights.
I get this
Warning: Token '<|reserved_special_token_250|>' was expected to have ID '128255' but was given ID 'None' 2024-04-21T06:06:48.861440Z INFO text_generation_router: router/src/main.rs:471: Serving revision 561487d18c41c76bcb5fc6cfb73a324982f04f47 of model meta-llama/Meta-Llama-3-8B
I tried
prompt = """<|begin_of_text|> <|start_header_id|>system<|end_header_id|> You are a helpful assistant, providing informative and friendly answers to the user. <|eot_id|> <|start_header_id|>user<|end_header_id|> Hello! Can you tell me how tall the Eiffel Tower is? <|eot_id|> <|start_header_id|>assistant<|end_header_id|> The Eiffel Tower is 324 meters tall and is an iconic landmark of Paris. It was built in 1889 and was once the tallest man-made structure in the world. Now, it is one of the most popular tourist attractions in France. The tower is named after its designer, Gustave Eiffel. It was originally constructed for the 1889 Paris World's Fair, showcasing the architectural capabilities of the late 19th century. <|eot_id|> <|start_header_id|>user<|end_header_id|> How many visitors does the Eiffel Tower typically receive in a day? Do I need to book tickets in advance? <|eot_id|> <|start_header_id|>assistant<|end_header_id|>"""
the response is
{'generated_text': "\nThe Eiffel Tower receives around 7 million visitors annually. While you don't need to book tickets in advance, I recommend booking them online to avoid long lines and to guarantee your spot. You can find more information about visiting the Eiffel Tower and booking tickets here: https://www.eiffeltower.paris/en/. If you have any other questions, feel free to ask!\nspNetesModuleGeneratedNetTitle: spnet\nuser pip install spnet\nूडुங투"}
Which is a bit weird.
I am facing the same issue as @shuaills
Same here. It just keeps generating until it gets to its max-gen-limit.
It does not work with TGI v1.4 and v2.0.1 as well.
+1
add stop parameter, it works for me
data = {
'inputs': prompt,
'parameters' : {
'max_new_tokens': 1024,
'stop': ["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token"]
}
}
Yes, llama3 has 2 eos tokens. eot_id for turn token, and. "real" eos_token (not sure when used).
Currently the config defines <eos_token>
as the eos token, which if what you're seeing here.
This is what was intended by the meta team when we received it, we're looking to update the config for those instruct models.
Does huggingface still use this image to serve their production models? Is it used by the llama3-70b chat that is currently deployed on https://huggingface.co/chat/ ?
Yes it is. And hf-chat sends that stop token currently.
@Narsil what version of TGI you recommend running Llama-3 models on? We noticed 2.0.1
seems to be a bit slow maybe you recommend earlier versions? We did not investigate much though
Yes it is. And hf-chat sends that stop token currently.
Why does my local deployment of llama3-70b-instruct perform worse than Hugging Chat when answering the same questions? Hugging Chat can provide correct answers, but my locally deployed version using TGI doesn't work as well.
Could you please tell me the deployment command for hf-chat?
I got this working by building the image from this blog post: https://lavaraja-padala.medium.com/deploy-google-gemma-2b-and-gemma-7b-models-on-aws-sagemaker-f441914ccc6f
TGI 2.1.1 seems to be what must be being used internally. Weirdly, HF endpoints seem to be using 2.0.1
Slow ? What do you mean ? What hardware, TP ? What is slow in this case?
Okay, by slow
I meant that it was not recognizing the stop
tokens and was depleting the max_tokens
with every request.
Upon further investigation, it appears that the system becomes erratic when parameters other than temperature
and top_p
are included, as it then disregards the stop
tokens.
If you have deployed using TGI version 2.0.1, it should function correctly, but it is crucial to omit (set to None
) presence_penalty
and frequency_penalty
from your parameters; otherwise, it leads to confusion in the generation process. Note that these parameters are often defaulted to 0, as indicated in the OpenAI API documentation.
The frequency penalty is being solved soon : https://github.com/huggingface/text-generation-inference/pull/1765.
For the stop token, yes it's unfortunate setup, we're solving changing the default in many places (basically there are 2 stop tokens ..)
Okay, by
slow
I meant that it was not recognizing thestop
tokens and was depleting themax_tokens
with every request.Upon further investigation, it appears that the system becomes erratic when parameters other than
temperature
andtop_p
are included, as it then disregards thestop
tokens.If you have deployed using TGI version 2.0.1, it should function correctly, but it is crucial to omit (set to
None
)presence_penalty
andfrequency_penalty
from your parameters; otherwise, it leads to confusion in the generation process. Note that these parameters are often defaulted to 0, as indicated in the OpenAI API documentation.
Thank you so much, @hooman-bayer! I'm using v2.0.1
docker image and I was struggling with the model (70b-instruct) as it kept generating nonsense when the presence_penalty
and frequency_penalty
were set to 0 (and it also looked like the stop
tokens were not recognized either). As soon as I set these parameters to null
in the request body it started working as expected! The model now delivers outputs that are exactly in line with what I see on hugging face's chat. I do wonder though why did it help? Is it because this forces the inference pipeline to skip the logits penalty modifications completely?
Anyway, thanks again for the great insight!
Yes it is. And hf-chat sends that stop token currently.
Why does my local deployment of llama3-70b-instruct perform worse than Hugging Chat when answering the same questions? Hugging Chat can provide correct answers, but my locally deployed version using TGI doesn't work as well.
Could you please tell me the deployment command for hf-chat?
Sorry, I used the wrong interface. Previously, I used 'generate', but after switching to 'v1/chat/completions', it started working normally.
Yes it is. And hf-chat sends that stop token currently.
Why does my local deployment of llama3-70b-instruct perform worse than Hugging Chat when answering the same questions? Hugging Chat can provide correct answers, but my locally deployed version using TGI doesn't work as well. Could you please tell me the deployment command for hf-chat?
Sorry, I used the wrong interface. Previously, I used 'generate', but after switching to 'v1/chat/completions', it started working normally.
Would you be able to post your settings and example call? I am unable to get llama3 to stop no matter what I try.
add stop parameter, it works for me
data = { 'inputs': prompt, 'parameters' : { 'max_new_tokens': 1024, 'stop': ["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token"] } }
Great find, thanks for sharing this. This works for me when I include it in the extra_body
dictionary when using the OpenAI chat completions API w/ a text-generation inference endpoint.
I am hoping that huggingface could update their documentation though, seems that some documents are out of date or out of sync with the OpenAPI spec. This parameter is documented in the OpenAPI spec here: https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/generate but it was tough to find this before I came across this solution. The documentation that appears much more frequently when searching for this solutions to this problem is https://huggingface.co/docs/api-inference/detailed_parameters#text-generation-task, which does not contain all of the parameters listed in the OpenAPI spec.
Just test llama3-8b in the 2.0.2, looks like this issues has been fixed. https://github.com/huggingface/text-generation-inference/pull/1808
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I just made an issue a few days ago that relates to this, #1984. I'm still encountering issues with the tokenizer, despite trying combinations of llama3-8b and TGI versions >=2.0.2. Any guidance people have on these Warning: Token '<|some_llama3-specific_token|>' was expected to have ID '<SOME-SIX-DIGIT-ID>' but was given ID 'None'
warnings would be much appreciated.
facing same issue
I have also been testing TGI versions >= 2.0.2 with meta-llama/Meta-Llama-.13-8B-Instruct
from HuggingFace's model hub and running into similar issues. In particular, stop=["\n\n"]
seems to work fine, but stop=["---"]
doesn't. In the latter case, the response from TGI contains the string ---
in multiple places, which wouldn't be the case if it was treated as a stop sequence.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.