What does this PR do?

Fixes https://github.com/huggingface/text-generation-inference/issues/420

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Jun 15 '23 14:06 Narsil

@Narsil This documentation is helpful, but I think there are still some things that could be clarified.

Which parts of the transformer inference count as the pre-fill step versus the decode step?
Does sequence-length affect the length of the model's context window, or is the context window fixed at the default maximum for that model architecture?

Jun 15 '23 15:06 Blair-Johnson

You send request:

input_ids = [A, B, C, A]  #   Those are tokens
new_token_D, past  = forward(input_ids)

That's a prefill step.

Then we continue generating new tokens in what we call a decode step:

input_ids = [new_token_D]
new_token_F, past = forward(input_ids, past)

Does sequence-length affect the length of the model's context window, or is the context window fixed at the default maximum for that model architecture?

In this tool it means the prompt size. text-generation-inference always uses the smallest possible window. Some models don't have a maximum window... And because flash models use flash we never have padding.

Jun 16 '23 07:06 Narsil

Great! Thanks for addressing those questions. My only remaining question is less related to the benchmark, but I'm curious what exactly you mean by the smallest possible window?

I understand that flash or sparse attention models won't use padding, but if the user generates a very long sequence, say thousands of tokens, how many of those tokens will a model in text-generation-inference attend to?

Jun 16 '23 18:06 Blair-Johnson

I understand that flash or sparse attention models won't use padding, but if the user generates a very long sequence, say thousands of tokens, how many of those tokens will a model in text-generation-inference attend to?

By default it will send everything and the model will attend to everything. This will crash if the model has a fixed sized context window. In that case you could also use truncate to drop every extra tokens to the left of the query (this is brutal, but there's no good way to make it cleverer in a general fashion)

but I'm curious what exactly you mean by the smallest possible window?

Let's say you have 3 queries: Query1: Prompt 100 tokens, generating max 10 more Query2: Prompt 20 tokens, generating max 100 more Query3: Prompt 100 tokens, generating max 100 more

If we have room to pass all 3 we will count (110 + 120 + 200 = 430 tokens) as our window. It's the smallest that will fit if all queries hit their limit.

Jun 19 '23 09:06 Narsil

@Narsil Awesome, thanks for the explanation!

Jun 20 '23 17:06 Blair-Johnson

@OlivierDehaene Can we merge this ?

Jul 04 '23 09:07 Narsil

text-generation-inference
text-generation-inference copied to clipboard

Adding some help for the options in `text-generation-benchmark`.

What does this PR do?

Before submitting

Who can review?

text-generation-inference text-generation-inference copied to clipboard

Adding some help for the options in `text-generation-benchmark`.

What does this PR do?

Before submitting

Who can review?

text-generation-inference
text-generation-inference copied to clipboard