text-generation-inference `/tokenize` - Optionally Apply Chat Template before Tokenization

Feature request

On the /tokenize endpoint of TGI, add an option to apply the chat template from the model's tokenizer, if existant, before tokenizing.

Motivation

The /tokenize endpoint of TGI is very useful in situations where an application requires information about the tokenization of a string, but doesn't have direct access to a model/tokenizer that can be loaded with AutoTokenizer.

Specifically, I have instances where I need to know the token count of a prompt before sending it to /v1/chat/completions so that I can appropriately truncate the input to be <= max_input_tokens.

/tokenize, however, does not adequately serve this purpose when calling /v1/chat/completions, as the tokenization we get is on the prompt without the chat template applied.

Since the chat template may differ by model, there is no generic way via a TGI endpoint to get the token count of a prompt after a chat template has been applied, meaning that preventing inputs from exceeding max_input_tokens is very difficult.

Your contribution

[Caveat - I do not know Rust, nor am I familiar with the inner workings of TGI]

I have done my best to read through the existing /tokenize implementation, and will attempt to provide a high-level overview of what might need to be changed, and where.

Adding an optional boolean parameter apply_chat_template to the /tokenize endpoint would suffice for my purposes.

It appears that one could mirror the exiting return_full_text boolean parameter of GenerateParameters.

Further more, I imagine the /tokenize [with chat template] implementation could would be very similar to what's happening at the /v1/chat/completions endpoint

`/v1/chat/completions` chat templatting implementation

    // apply chat template to flatten the request into a single input
    let mut inputs = match infer.apply_chat_template(req.messages) {
        Ok(inputs) => inputs,
        Err(err) => {
            metrics::increment_counter!("tgi_request_failure", "err" => "validation");
            tracing::error!("{err}");
            return Err((
                StatusCode::UNPROCESSABLE_ENTITY,
                Json(ErrorResponse {
                    error: err.to_string(),
                    error_type: err.error_type().to_string(),
                }),
            ));
        }
    };

Example API calls with proposed parameter:

Without Chat Template Applied (current behavior)

curl -X 'POST' \
  'http://my.tgi.host/tokenize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "My name is Olivier and I"
}'

With Chat Template Applied (proposed behavior)

curl -X 'POST' \
  'http://localhost:8083/tokenize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "My name is Olivier and I",
  "parameters": {
     "apply_chat_template": true
  }
}'

Apr 04 '24 15:04 elsell

Chat template is used through an openai compatibility layer, meaning the payloads do not look so simple.

Does openai provide any means to do tokenization, if yes we can try to mimick that, if not we're not going to do it (there's pretty much infinite type of endpoints/payloads users could send, for now the lowest common denominator is sending raw text and it works fine in most uses cases).

It's also a way to leak the system prompts which might not be something model authors actually want.

Apr 10 '24 10:04 Narsil

@Narsil Thanks for looking into this.

For posterity, one of the major drivers behind this request is due to using TGI in an offline environment.

Because I am loading a model locally, the model_id key returned from TGI's /info endpoint has a local filepath instead of a HuggingFace repo name.

I'm trying to minimize how much my client application has to know about what model is being served from TGI, and currently the client has to know a HuggingFace repo to load a tokenizer.

So, for folks not constrained by an offline environment, I suppose the /info endpoint would provide sufficient information to the client that it would be able to dynamically load up a tokenizer and apply a chat template, making my request unnecessary.

Apr 10 '24 13:04 elsell

+1 for the offline functionality @elsell is requesting.

Apr 10 '24 16:04 ZQ-Dev8

What do you think about adding an endpoint like the one I'm thinking for vLLM ?

The idea is to share the tokenizer via \get_tokenizer endpoint if whoever runs the server enables it.

My goal is to be able to run lm-eval-harness as if it were a client. Specifically, I plan to add an option to OpenaiCompletionsLM that allows using get_tokenizer, receiving a json, and then instantiating the tokenizer locally.

@elsell This solution could probably work for your issue.

May 02 '24 09:05 AguirreNicolas

Hi folks, I was looking for a similar solution for this too.

I'm hosting TGI (docker) internally to play around (Llama 3, model is cloned locally). But to use the apply_prompt_template() method, I had to initialize the tokenizer somewhere, so for now its either adding a middle python preprocess layer in my pipeline, or wrapping TGI with a custom API that does pre+inference+post processes.

For my use case, its best if client side or components before TGI doesn't have to hold the tokenizer, thus we don't need to add another layer to call apply_prompt_template().

TLDR: Request to add a param to apply prompt template when TGI receives a request. Something like:

# perform inference
import requests

headers = {
    "Content-Type": "application/json",
}

data = {
    "inputs": llm_prompt,
    "parameters": {
        "max_new_tokens": 2000,
        "temperature": 0.1,
        "apply_prompt_template": True,
    },
}
response = requests.post()

May 24 '24 09:05 cringelord000222

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jun 24 '24 01:06 github-actions[bot]