text-generation-inference Output truncated when max

System Info

docker version: sha-0b95693 Model being used: /v1/chat/completions

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

{ "messages": [ { "role": "user", "content": "Who are you?" } ], "model": "", "seed": 42, "max_tokens": null, "temperature": null, "stream": false }

Expected behavior

If I interpret it correctly, max_tokens=None will automatically calculate the maximum length of the output, but sometimes the outputs are truncated with "finish_reason": "length", right?

Aug 15 '24 03:08 paulcx

Hi @paulcx 👋 thanks for opening the issue!

I'm not a 100% sure I understand your question but if it's "can the output be truncated even if leaving max_tokens == null" then the answer is yes. I depends also a bit how you setup your TGI server, but I'll provide the code path for where the validation on the max_tokens is done. Hopefully this helps you understand what's going on:

https://github.com/huggingface/text-generation-inference/blob/38773453ae0d29fba3dc79a38d589ebdc5451093/router/src/validation.rs#L137-L194

Aug 19 '24 09:08 ErikKaum

Hi @paulcx 👋 thanks for opening the issue!

I'm not a 100% sure I understand your question but if it's "can the output be truncated even if leaving max_tokens == null" then the answer is yes. I depends also a bit how you setup your TGI server, but I'll provide the code path for where the validation on the max_tokens is done. Hopefully this helps you understand what's going on:

https://github.com/huggingface/text-generation-inference/blob/38773453ae0d29fba3dc79a38d589ebdc5451093/router/src/validation.rs#L137-L194

Hi @ErikKaum, your understanding is correct.

I checked the code above but remain perplexed; my settings (max_new_tokens 8000, no truncate, max_tokens null) appear correct? Could this lead to truncation?

Aug 26 '24 00:08 paulcx

Sorry I lead you astray there for a bit. This is definitely the culprit: https://github.com/huggingface/text-generation-inference/blob/6cb42f49ae47a117e8f1bdfcdb5cbe42332dc360/router/src/server.rs#L741

I'll have to double check that changing this to None doesn't break something. But it most seems to be the thing causing your issue.

Sep 03 '24 14:09 ErikKaum

Okay so there's actually a good reason for it defaulting to 100. The reason is that the max_new_tokens takes memory in the scheduling. So we have to pick some default and in our case we do it with a (pessimistically) low number. This is to not being wasteful with memory and to ensure all requests have room once they are allocated.

But I agree it can be a bit confusing when sending in "max_tokens": null and still being truncated. But the reason it happens is that it then triggers the default.

But does this make sense to you? 😅

Sep 03 '24 14:09 ErikKaum

Okay so there's actually a good reason for it defaulting to 100. The reason is that the max_new_tokens takes memory in the scheduling. So we have to pick some default and in our case we do it with a (pessimistically) low number. This is to not being wasteful with memory and to ensure all requests have room once they are allocated.

But I agree it can be a bit confusing when sending in "max_tokens": null and still being truncated. But the reason it happens is that it then triggers the default.

But does this make sense to you? 😅

If I set max_new_tokens=null, but it doesn't trigger the default value, what should I do?

Sep 04 '24 05:09 paulcx

Sorry I was a bit unclear. If you set it to null, it will trigger the default value 👍

Sep 05 '24 09:09 ErikKaum

Sorry I was a bit unclear. If you set it to null, it will trigger the default value 👍

However, in some cases, if I set max_new_tokens to null, the server can automatically calculate max_new_tokens = max-total-tokens - actual input tokens. Correct me if I'm wrong.

Sep 06 '24 01:09 paulcx

Ah no sorry that shouldn't be the case. Was me leading you astray earlier, I also had the wrong understanding regarding this earlier.

Sep 06 '24 14:09 ErikKaum

Is there a way to prevent the output from being truncated (i.e. set max_tokens to infinity)? At the very least, it'd be helpful to update the documentation to mention that max_tokens defaults to 100 if it's not set. That is a surprising default to me.

Sep 10 '24 17:09 elliottlawrence

For me, max_new_tokens = max-total-tokens - actual input tokens is a very easy calculation to understand, right?

Sep 11 '24 00:09 paulcx

Is there a way to prevent the output from being truncated (i.e. set max_tokens to infinity)? At the very least, it'd be helpful to update the documentation to mention that max_tokens defaults to 100 if it's not set. That is a surprising default to me.

I agree @elliottlawrence that this should be documented better.

Also to clarify, setting it to inf won't work in practice, since the GPU doesn't have infinite memory. The reason for requiring a defined max_tokens comes from the fact that the memory is allocated in advance.

Sep 26 '24 12:09 ErikKaum

I don't understand why the max_new_tokens does work in /generate but the max_tokens does not work in v1/chat/completions. Are they not the same logic?

Sep 29 '24 06:09 paulcx

Sorry @paulcx I don't follow 100%.

In which way do you mean works? As in that it's not truncated in /generate but is truncated in v1/chat/completions? If you happen to have an example where they behave differently, that would help 👍

Sep 29 '24 08:09 ErikKaum

Sorry @paulcx I don't follow 100%.

In which way do you mean works? As in that it's not truncated in /generate but is truncated in v1/chat/completions? If you happen to have an example where they behave differently, that would help 👍

For "/generate," setting "max_new_tokens = None" calculates the output length as max_new_tokens = max-total-tokens - actual input tokens, ensuring the entire output is returned without truncation. In contrast, for "v1/chat/completions," using "max_tokens = None" results in the output being truncated to the default max_tokens (approximately 100?).

Sep 29 '24 08:09 paulcx

Sorry @paulcx I don't follow 100%.

In which way do you mean works? As in that it's not truncated in /generate but is truncated in v1/chat/completions? If you happen to have an example where they behave differently, that would help 👍

@ErikKaum Here's an example showing input and output from two APIs with the same input and max_tokens/max_new_tokens argument. The first output is truncated (num_token is 100), while the second example works perfectly (tokens are more than 150).

input of v1/chat/completions

{
  "model": "Qwen2.5-7B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": "output a random text over 150 tokens"
    }
  ],
  "max_tokens": null
}

output of v1/chat/completions

{
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "In the heart of a bustling city, where skyscrapers reach for the clouds and the hum of activity is a constant melody, there lies an almost forgotten place: an old park that seems wrapped in a different era's charm. This park, with its creaky wooden benches and trees that seemed to have aged along with the city itself, is a sanctuary for those few who still know its corners.\n\nEarly in the morning, the park is bathed in the"
      },
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 19,
    "completion_tokens": 100,
    "total_tokens": 119
  }
}

input of /generate

{
  "inputs": "<|im_start|>user\noutput a random text over 150 tokens<|im_end|>\n<|im_start|>assistant\n",
  "parameters": {
    "max_new_tokens": null
  }
}

output of /generate

{
  "generated_text": "In the heart of a bustling city, where the hum of traffic and the chatter of pedestrians create a constant symphony, there exists a small, unassuming bookstore that seems to defy the passage of time. The store, with its wooden shelves lined with well-worn books, exudes an air of timeless wisdom. The aroma of aged paper and the faint scent of coffee from the small café tucked into one corner combine to create an inviting atmosphere.\n\nThe owner, an enigmatic figure known simply as Mr. E, is a man of few words but a wealth of stories. He has a knack for knowing exactly which book a customer needs, even if they don't know it themselves. It's said that Mr. E has a sixth sense for matching people with the right book at the right moment, a talent that has earned him a loyal following.\n\nOne rainy afternoon, a young woman named Clara wandered into the bookstore, seeking shelter from the downpour. She was a writer, struggling with writer's block, and the city's noise had become a cacophony that stifled her creativity. As she entered the store, the bell above the door chimed softly, a sound that seemed to hush the world outside.\n\nMr. E, behind the counter, was arranging a display of first editions. He looked up and smiled, a gesture that was both warm and mysterious. \"Looking for something in particular, or perhaps just a bit of escape?\" he asked, his voice calm and soothing.\n\nClara hesitated, feeling the weight of her block pressing down on her. \"I'm not sure,\" she admitted. \"I just need... inspiration, I think.\"\n\nMr. E nodded and gestured towards the back of the store. \"Follow me,\" he said, leading her through the maze of shelves. He pulled out a book with a worn cover, its spine cracked from years of use. \"This one,\" he said, \"might just be the key to unlocking your creativity.\"\n\nClara took the book, feeling a strange sense of destiny in its weight. The title, \"Echoes of the Unseen,\" intrigued her. As she settled into a cozy armchair by the window, the rain tapping rhythmically against the glass, she began to read.\n\nHours passed like minutes, and when Clara finally looked up, the rain had stopped, and the sky was beginning to darken. The words in the book had woven a tapestry of ideas in her mind, and she felt a surge of inspiration she hadn't experienced in months.\n\nShe returned the book to Mr. E, who simply nodded as if he had expected this outcome. \"Thank you,\" she said, her voice filled with gratitude. \"I think I've found what I needed.\"\n\nMr. E simply smiled and watched her leave, his eyes twinkling with the knowledge of the stories yet to be written. The bookstore, with its quiet magic, had once again played its part in the grand narrative of lives interwoven with words.\n\nAs the city lights began to flicker on, the bookstore remained a sanctuary, a place where stories are not just read but lived, and where the pages of a book can change the course of a life."
}

In addition, I tried vllm API v1/chat/completions with same input. The output is not truncated either.

{
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "In the realm of ancient forests and mystic meadows, there lay a land untouched by the hands of time. This was a land of tall, whispering trees that seemed to guard secrets older than the stars. The air was thick with the scent of pine and wildflowers, and a gentle breeze carried the melodies of unseen birds, weaving through the dense foliage. \n\nAt the heart of this serene land stood an old, ancient stone bridge, its arches worn smooth by countless years of rain and the footsteps of various creatures. It was said that this bridge connected the world of the living with the realm of spirits, a place where the living and the dead could meet under the light of the full moon.\n\nOne such evening, as the moon painted the sky a silvery hue, a young girl named Elara stepped onto the bridge, clutching an old, tattered book that had been passed down through generations in her family. The book, full of forgotten spells and ancient wisdom, was said to hold the key to understanding the world beyond. Her heart raced with both fear and excitement, for she was about to embark on a journey that no one in her family had dared to attempt.\n\nElara's village, nestled at the edge of the forest, had long warned of the dangers that lurked in the woods and the mysteries that lay beyond the stone bridge. They spoke of enchanted creatures, of lakes that could grant wishes to those pure of heart, and of hills that whispered the secrets of the universe. But Elara was different; she was curious, brave, and had always been drawn to the unknown.\n\nAs she crossed the bridge, her footsteps echoed softly, and the air seemed to grow colder, the moonlight casting eerie shadows that danced around her. Suddenly, she felt a presence, not threatening, but almost... welcoming. She turned her head, and there, on the edge of her vision, was a figure cloaked in white, its face obscured by the hood of its robe.\n\n\"Elara, daughter of the wind and the earth,\" the figure spoke, its voice as gentle as the wind through the trees. \"You seek the wisdom of the ancients. What is it that you seek to learn?\"\n\nElara took a deep breath and replied, \"I seek to understand the balance between the living and the dead, to bridge the gap between the known and the unknown. I wish to use the knowledge in this book to help my village and to protect the land we cherish.\"\n\nThe figure nodded, and with a wave of its hand, the surrounding air shimmered, revealing a path that had not been visible before. \"Walk this path, Elara, and your journey will lead you to the ancient grove where the wisdom of the ages is kept. Remember, the power you seek is not given lightly.\"\n\nWith a newfound determination, Elara set off on the path, each step taking her deeper into the mysteries of the land. The journey was not without its challenges; she encountered creatures both beautiful and fearsome, riddles that tested her wit, and trials that challenged her courage. But with the wisdom of the old book and the strength of her resolve, she overcame each obstacle.\n\nFinally, she reached the ancient grove, a place where time seemed to stand still. The trees were taller than any she had seen, their leaves shimmering with a faint, otherworldly light. In the center of the grove stood a stone circle, and within it, an ancient oak with a hollow at its base, filled with glowing runes.\n\nElara approached the oak, her heart pounding. With a deep breath, she recited the incantation from the book, and as she spoke, the runes began to glow brighter, and the air thrummed with ancient magic. The wisdom she sought began to flow into her, filling her with understanding and power beyond her wildest dreams.\n\nAs the dawn's light broke over the horizon, Elara returned to her village, not just as the same young girl who had left that evening, but as a keeper of ancient wisdom, ready to use her newfound knowledge to protect and nurture the land she loved. The villagers looked at her with awe, for they saw the change in her eyes, the light of the ancient grove that now shone within her. And so, the legacy of Elara, the bridge between worlds, was born, her name whispered in stories for generations to come.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 18,
    "total_tokens": 919,
    "completion_tokens": 901
  },
  "prompt_logprobs": null
}

Oct 15 '24 05:10 paulcx

Okay gotcha! Thanks for being elaborate on this 👍 The difference between v1/chat/completions and /generate is indeed a bit off.

I'll ping @drbh I think he might know better!

Oct 15 '24 13:10 ErikKaum

Okay gotcha! Thanks for being elaborate on this 👍 The difference between v1/chat/completions and /generate is indeed a bit off.

I'll ping @drbh I think he might know better!

Hi @ErikKaum @drbh , any idea or plan on this issue?

Oct 21 '24 08:10 paulcx

Hi, I think this was a start but there seems to be some direction change @drbh?

Oct 21 '24 11:10 ErikKaum

Hi, I think this was a start but there seems to be some direction change @drbh?

Any fix in docker image 2.4.0?

Oct 28 '24 05:10 paulcx

@ErikKaum A problem that arises from default value of 100 is that when input length is near to the max_total_tokens, user would encounter this validation error while he/she hasn't set any max_new_tokens.

ERROR chat_completions:generate:generate_stream: text_generation_router::infer: router/src/infer/mod.rs:114: `inputs` tokens + `max_new_tokens` must be <= 4096. Given: 4003 `inputs` tokens and 100 `max_new_tokens`

Couldn't we have max_new_tokens = min(max_total_tokens.saturating_sub(input_length), 100)?

Nov 03 '24 21:11 sadra-barikbin

text-generation-inference
text-generation-inference copied to clipboard

Output truncated when max_tokens is None

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference text-generation-inference copied to clipboard

Output truncated when max_tokens is None

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard