text-generation-inference
text-generation-inference copied to clipboard
Request a new api endpoint to check and retrieve token length for given text/prompt
Feature request
Given that LLM models have a max token length. It would be better to have an API that can be invoked that checks the token length to see whether the LLM would accept the prompt or not. This API can be invoked to get the token length of the prompt and modify it in case it exceeds before invoking the generate API.
Suggested API Format
/token_length
input: {'inputs' :
output: {'accept' : bool, 'token_length': integer}
Motivation
For chat-based LLM apps, having this API would help in identifying how much of the chat to be sent to the LLM. Ideally the entire chat should be sent but if the chat exceeds max token length then we chop off the initial few conversations and send it again.
Having a token_length API would help in figuring the above out
Your contribution
Can submit a PR for this but my expertise is in Python rather than Rust
I'm not sure if this is a good idea.
First, you already have the truncate
parameter to cover this use case even though it is not perfect as it will cut the prompt on the left at an arbitrary place.
Second, if we add this route, how would you actually use it? The bool does not provide you the correct amount of info; you will need to retry until it is finally true
. On the other hand, token_length
is a useless information as you don't have the tokenizer on your side and still don't know where to cut.
The best option will always be to have a tokenizer on your side.
this API might be good for use case like displaying number of tokens in the input in user interface, like how OpenAI playground display currently, and will highlight in red colour if the tokens in input is exceeding maximum.
No, sending an API request to check your token count is not something we want. This compute needs to happen client side.
If you're interested, you can compile the tokenizer down to WASM which would make it usable on the web. (It's unstable because regexp engine has to be different, this shouldn't affect Llama&co which do not use that feature)
https://github.com/huggingface/tokenizers/tree/main/tokenizers/examples/unstable_wasm
compile the tokenizer down to WASM
This looks interesting... can experiment and check how it behaves with various tokenizers
@OlivierDehaene
First, you already have the truncate parameter to cover this use case even though it is not perfect as it will cut the prompt on the left at an arbitrary place.
I am wondering if you have thought about some ways to improve this? This will be especially problematic if there are also starting special tokens such as <|prompting|>
that will be automatically cut.
What about allowing to specify tokens that should never be truncated? Alternatively one could also consider adding additional settings for truncation side.
Alternatively one could also consider adding additional settings for truncation side.
No. Server side truncation should be seen as a last resort. We will never offer enough flexibility on the server to cover all use-cases.
Makes sense. Do you think it could be reasonable to return the nr. of input and output tokens of the request in a separate field of the API response. This would allow downstream applications to better handle load and token counting of requests. Or would this also go too far in terms of use case focus of this repo?
This would allow downstream applications to better handle load and token counting of requests.
How?
Main use case is having services on top that can then implement rate limits per user and track token usage.
For example, using chat-ui on top, one could add these limits per user.
I assume the tokenizer of the model to count tokens is inside the deployed docker container, is this correct?
What would be the recommended way to count tokens for a custom model to make sure the prompt does not exceed the max limit? An API endpoint could be useful for this.
Same tokenizer client side either with WASM or something else.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.