text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Mangled generation for string sequences containing`<space>'m` with Llama 3.1

Open tomjorquera opened this issue 10 months ago • 4 comments
trafficstars

System Info

We're running TGI with Llama 3.1 8b instruct, and observed some weird values when asking the LLM to generate strings containing the combination of letters <space>'m (e.g. the string "for 'manual", used in the reproduction code).

When running client.text_generation with a prompt leading to the LLM to generate a string containing the sequence 'm, the result gets mangled, both in the tokens stream and the generated_text attributed (tested with both the sync and async version of InferenceClient).

Interestingly, the mangling is different between the twos: the tokens stream "eats" the m character, while the generated_text eats the leading space. Meaning the result from the tokens stream will be different than the one provided by generated_text (and both will be incorrect).

I suspect the issue may be linked to a special handling for I'm, as I did not reproduce the issue with other sequences 'x with x different than m.

Information

  • [x] Docker
  • [ ] The CLI directly

Tasks

  • [ ] An officially supported command
  • [ ] My own modifications

Reproduction

Running the following:


from huggingface_hub import InferenceClient

# TGI run with ghcr.io/huggingface/text-generation-inference:3.0.1
# and arguments"--model-id meta-llama/Meta-Llama-3.1-8B-Instruct --revision d04e592bb4f6aa9cfee91e2e20afa771667e1d4b --hostname 0.0.0.0 --port 8080 --quantize bitsandbytes-nf4"
endpoint = "http://localhost:8080"

client = InferenceClient(endpoint)

prompt = """Repeat the following once and exactly once:
new result for 'manual'
"""

tokens = []
for answer in client.text_generation(
    prompt,
    stream=True,
    details=True,
    max_new_tokens=6, # to limit output, same behavior without this parameter
):
    print(answer)
    if not answer.token.special:
        tokens.append(answer.token.text)

print("".join(tokens))

Will print the following output:

TextGenerationStreamOutput(index=1, token=TextGenerationStreamOutputToken(id=943, logprob=-2.9980469, special=False, text='new'), details=None, generated_text=None, top_tokens=None)
TextGenerationStreamOutput(index=2, token=TextGenerationStreamOutputToken(id=1121, logprob=-0.35864258, special=False, text=' result'), details=None, generated_text=None, top_tokens=None)
TextGenerationStreamOutput(index=3, token=TextGenerationStreamOutputToken(id=369, logprob=-0.10369873, special=False, text=' for'), details=None, generated_text=None, top_tokens=None)
TextGenerationStreamOutput(index=4, token=TextGenerationStreamOutputToken(id=364, logprob=-0.15783691, special=False, text=" '"), details=None, generated_text=None, top_tokens=None)
TextGenerationStreamOutput(index=5, token=TextGenerationStreamOutputToken(id=20310, logprob=-2.21875, special=False, text='anual'), details=None, generated_text=None, top_tokens=None)
TextGenerationStreamOutput(index=6, token=TextGenerationStreamOutputToken(id=1270, logprob=-0.85546875, special=False, text="'\n"), details=TextGenerationStreamOutputStreamDetails(finish_reason='length', generated_tokens=6, input_length=15, seed=None), generated_text="new result for'manual'\n", top_tokens=None)
new result for 'anual'

Note that print("".join(tokens)) gives the string "new result for 'anual'\n" (since tokens with index 4 and 5 are respectively " '" and 'anual'), but `generated_text in token index 6 indicates instead "new result for'manual'\n"

So the two results are inconsistent and both mangled the string in different ways (missing the m in one case, missing the space in the other).

Expected behavior

Both the tokens and the generated_text attribute should result in the same value: "new result for 'manual'\n"

tomjorquera avatar Jan 20 '25 13:01 tomjorquera