text-generation-inference
text-generation-inference copied to clipboard
Mangled generation for string sequences containing`<space>'m` with Llama 3.1
System Info
We're running TGI with Llama 3.1 8b instruct, and observed some weird values when asking the LLM to generate strings containing the combination of letters <space>'m (e.g. the string "for 'manual", used in the reproduction code).
When running client.text_generation with a prompt leading to the LLM to generate a string containing the sequence 'm, the result gets mangled, both in the tokens stream and the generated_text attributed (tested with both the sync and async version of InferenceClient).
Interestingly, the mangling is different between the twos: the tokens stream "eats" the m character, while the generated_text eats the leading space. Meaning the result from the tokens stream will be different than the one provided by generated_text (and both will be incorrect).
I suspect the issue may be linked to a special handling for I'm, as I did not reproduce the issue with other sequences 'x with x different than m.
Information
- [x] Docker
- [ ] The CLI directly
Tasks
- [ ] An officially supported command
- [ ] My own modifications
Reproduction
Running the following:
from huggingface_hub import InferenceClient
# TGI run with ghcr.io/huggingface/text-generation-inference:3.0.1
# and arguments"--model-id meta-llama/Meta-Llama-3.1-8B-Instruct --revision d04e592bb4f6aa9cfee91e2e20afa771667e1d4b --hostname 0.0.0.0 --port 8080 --quantize bitsandbytes-nf4"
endpoint = "http://localhost:8080"
client = InferenceClient(endpoint)
prompt = """Repeat the following once and exactly once:
new result for 'manual'
"""
tokens = []
for answer in client.text_generation(
prompt,
stream=True,
details=True,
max_new_tokens=6, # to limit output, same behavior without this parameter
):
print(answer)
if not answer.token.special:
tokens.append(answer.token.text)
print("".join(tokens))
Will print the following output:
TextGenerationStreamOutput(index=1, token=TextGenerationStreamOutputToken(id=943, logprob=-2.9980469, special=False, text='new'), details=None, generated_text=None, top_tokens=None)
TextGenerationStreamOutput(index=2, token=TextGenerationStreamOutputToken(id=1121, logprob=-0.35864258, special=False, text=' result'), details=None, generated_text=None, top_tokens=None)
TextGenerationStreamOutput(index=3, token=TextGenerationStreamOutputToken(id=369, logprob=-0.10369873, special=False, text=' for'), details=None, generated_text=None, top_tokens=None)
TextGenerationStreamOutput(index=4, token=TextGenerationStreamOutputToken(id=364, logprob=-0.15783691, special=False, text=" '"), details=None, generated_text=None, top_tokens=None)
TextGenerationStreamOutput(index=5, token=TextGenerationStreamOutputToken(id=20310, logprob=-2.21875, special=False, text='anual'), details=None, generated_text=None, top_tokens=None)
TextGenerationStreamOutput(index=6, token=TextGenerationStreamOutputToken(id=1270, logprob=-0.85546875, special=False, text="'\n"), details=TextGenerationStreamOutputStreamDetails(finish_reason='length', generated_tokens=6, input_length=15, seed=None), generated_text="new result for'manual'\n", top_tokens=None)
new result for 'anual'
Note that print("".join(tokens)) gives the string "new result for 'anual'\n" (since tokens with index 4 and 5 are respectively " '" and 'anual'), but `generated_text in token index 6 indicates instead "new result for'manual'\n"
So the two results are inconsistent and both mangled the string in different ways (missing the m in one case, missing the space in the other).
Expected behavior
Both the tokens and the generated_text attribute should result in the same value: "new result for 'manual'\n"