text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Request a new api endpoint to check and retrieve token length for given text/prompt

Open Perseus14 opened this issue 1 year ago • 13 comments

Feature request

Given that LLM models have a max token length. It would be better to have an API that can be invoked that checks the token length to see whether the LLM would accept the prompt or not. This API can be invoked to get the token length of the prompt and modify it in case it exceeds before invoking the generate API.

Suggested API Format

/token_length

input: {'inputs' : }

output: {'accept' : bool, 'token_length': integer}

Motivation

For chat-based LLM apps, having this API would help in identifying how much of the chat to be sent to the LLM. Ideally the entire chat should be sent but if the chat exceeds max token length then we chop off the initial few conversations and send it again.

Having a token_length API would help in figuring the above out

Your contribution

Can submit a PR for this but my expertise is in Python rather than Rust

Perseus14 avatar Jun 09 '23 07:06 Perseus14

I'm not sure if this is a good idea.

First, you already have the truncate parameter to cover this use case even though it is not perfect as it will cut the prompt on the left at an arbitrary place.

Second, if we add this route, how would you actually use it? The bool does not provide you the correct amount of info; you will need to retry until it is finally true. On the other hand, token_length is a useless information as you don't have the tokenizer on your side and still don't know where to cut.

The best option will always be to have a tokenizer on your side.

OlivierDehaene avatar Jun 09 '23 08:06 OlivierDehaene

this API might be good for use case like displaying number of tokens in the input in user interface, like how OpenAI playground display currently, and will highlight in red colour if the tokens in input is exceeding maximum.

image image

gsaivinay avatar Jun 12 '23 11:06 gsaivinay

No, sending an API request to check your token count is not something we want. This compute needs to happen client side.

OlivierDehaene avatar Jun 12 '23 12:06 OlivierDehaene

If you're interested, you can compile the tokenizer down to WASM which would make it usable on the web. (It's unstable because regexp engine has to be different, this shouldn't affect Llama&co which do not use that feature)

https://github.com/huggingface/tokenizers/tree/main/tokenizers/examples/unstable_wasm

Narsil avatar Jun 12 '23 12:06 Narsil

compile the tokenizer down to WASM

This looks interesting... can experiment and check how it behaves with various tokenizers

gsaivinay avatar Jun 12 '23 12:06 gsaivinay

@OlivierDehaene

First, you already have the truncate parameter to cover this use case even though it is not perfect as it will cut the prompt on the left at an arbitrary place.

I am wondering if you have thought about some ways to improve this? This will be especially problematic if there are also starting special tokens such as <|prompting|> that will be automatically cut.

What about allowing to specify tokens that should never be truncated? Alternatively one could also consider adding additional settings for truncation side.

psinger avatar Jun 15 '23 08:06 psinger

Alternatively one could also consider adding additional settings for truncation side.

No. Server side truncation should be seen as a last resort. We will never offer enough flexibility on the server to cover all use-cases.

OlivierDehaene avatar Jun 15 '23 09:06 OlivierDehaene

Makes sense. Do you think it could be reasonable to return the nr. of input and output tokens of the request in a separate field of the API response. This would allow downstream applications to better handle load and token counting of requests. Or would this also go too far in terms of use case focus of this repo?

psinger avatar Jun 27 '23 16:06 psinger

This would allow downstream applications to better handle load and token counting of requests.

How?

OlivierDehaene avatar Jun 28 '23 09:06 OlivierDehaene

Main use case is having services on top that can then implement rate limits per user and track token usage.

For example, using chat-ui on top, one could add these limits per user.

psinger avatar Jun 28 '23 09:06 psinger

I assume the tokenizer of the model to count tokens is inside the deployed docker container, is this correct?

What would be the recommended way to count tokens for a custom model to make sure the prompt does not exceed the max limit? An API endpoint could be useful for this.

BEpresent avatar Jul 25 '23 14:07 BEpresent

Same tokenizer client side either with WASM or something else.

OlivierDehaene avatar Jul 25 '23 17:07 OlivierDehaene

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jul 30 '24 01:07 github-actions[bot]