text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

input tokens exceeded `max_input_tokens`

Open LanSnowZ opened this issue 1 year ago • 0 comments

System Info

Docker

Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.80.0 Commit sha: 169178b937d0c4173b0fdcd6bf10a858cfe4f428 Docker label: sha-169178b nvidia-smi

Args { model_id: "/share/base_model/Mistral-Nemo-Instruct-2407-GPTQ", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some( Gptq, ), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: Some( 8192, ), max_input_length: None, max_total_tokens: Some( 10240, ), waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "545eaf4c39af", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], api_key: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, usage_stats: On, }

Information

  • [X] Docker
  • [ ] The CLI directly

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

I launch a TGI server on a A100 GPU machine and served Mistral-Nemo-Instruct-2407-GPTQ model. As shown in config, I set max_input_tokens to 8192 and max_total_tokens to 10240. But when I sent a message contains more tokens than 8192, it seems not to be truncated. The error imf is shown below:

2024-10-11T11:27:58.527278Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/[mod.rs:105](http://mod.rs:105/): `inputs` tokens + `max_new_tokens` must be <= 10240. Given: 9266 `inputs` tokens and 1000 `max_new_tokens`

My question:

  1. Will TGI automatically do truncation for user_input according to max_input_tokens?
  2. Could I use some parameters to truncate input length to less than max_input_tokens?

Thanks a lot for help.

Expected behavior

Input tokens should be truncated.

LanSnowZ avatar Oct 12 '24 03:10 LanSnowZ