text-generation-inference Prefix caching causes 2 different responses from the same HTTP call with seed set depending on what machine calls

System Info

tag:2.3.1 docker image running on nvidia 4090 on top of 20.04 Ubuntu

2024-10-18T19:25:04.160854Z  INFO text_generation_launcher: Args {
    model_id: "Qwen/Qwen2.5-Coder-1.5B",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: Some(
        Fp8,
    ),
    speculate: Some(
        6,
    ),
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: Some(
        9000,
    ),
    max_input_length: None,
    max_total_tokens: Some(
        9999,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: Some(
        10000,
    ),
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "3f2367249b02",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: None,
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
    usage_stats: On,
}

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Dumped the raw HTTP request from my server that is calling then replicated on my personal machine to the same TGI server and get 2 different responses. I dumped the raw HTTP calls because after validating the payload my only thought was maybe headers but there are no headers included aside from Content-Type and Content-Length. Using fasthttp in golang to make the call. The current example isn't the best since the responses are close. Normally I get garbage from the server call and quality from the local machine call. I tried rolling back to v2.2.0 to exclude prefix caching in case that was the problem but the qwen model is not supported. Is it possible to disable prefix caching to test? Server

POST /generate HTTP/1.1
Host: <REDACTED>:8080
Content-Type: application/json
Content-Length: 789

{"inputs":"\u003c|file_sep|\u003ebot/tts_handler.py\n\u003c|fim_prefix|\u003eimport io\nimport logging\nfrom elevenlabs import generate, Voice, VoiceSettings, set_api_key\nfrom config import Config\n\nlogger= logging.getLogger(__name__)\n\nclass TTSHandler:\n    def __init(self, config: Config):\n        self.config = config\n        self.voice = Voice(config.voice_id)\n        self.voice_settings = VoiceSettings(config.voice_settings)\n        \u003c|fim_suffix|\u003e\n        \n\n\n\u003c|fim_middle|\u003e","parameters":{"do_sample":false,"max_new_tokens":1000,"return_full_text":false,"stop":["\u003c|file_sep|\u003e","\u003c|repo_name|\u003e","\u003c|fim_prefix|\u003e","\n"],"seed":69420,"temperature":0.3,"top_k":50,"top_p":0.8,"watermark":false,"details":true},"stream":false}

Response

{"generated_text":"set_api_key(config.elevenlabs_api_key1\n","details":{"finish_reason":"stop_sequence","generated_tokens":12,"seed":69420,"prefill":[],"tokens":[{"id":746,"text":"set","logprob":0.0,"special":false},{"id":11697,"text":"_api","logprob":0.0,"special":false},{"id":3097,"text":"_key","logprob":0.0,"special":false},{"id":8754,"text":"(config","logprob":0.0,"special":false},{"id":1734,"text":".e","logprob":0.0,"special":false},{"id":273,"text":"le","logprob":0.0,"special":false},{"id":1037,"text":"ven","logprob":0.0,"special":false},{"id":70271,"text":"labs","logprob":0.0,"special":false},{"id":11697,"text":"_api","logprob":0.0,"special":false},{"id":3097,"text":"_key","logprob":0.0,"special":false},{"id":16,"text":"1","logprob":0.0,"special":false},{"id":198,"text":"\n","logprob":0.0,"special":false}]}}

Local

POST /generate HTTP/1.1
Host: <REDACTED>:8080
Content-Type: application/json
Content-Length: 789

{"inputs":"\u003c|file_sep|\u003ebot/tts_handler.py\n\u003c|fim_prefix|\u003eimport io\nimport logging\nfrom elevenlabs import generate, Voice, VoiceSettings, set_api_key\nfrom config import Config\n\nlogger= logging.getLogger(__name__)\n\nclass TTSHandler:\n    def __init(self, config: Config):\n        self.config = config\n        self.voice = Voice(config.voice_id)\n        self.voice_settings = VoiceSettings(config.voice_settings)\n        \u003c|fim_suffix|\u003e\n        \n\n\n\u003c|fim_middle|\u003e","parameters":{"do_sample":false,"max_new_tokens":1000,"return_full_text":false,"stop":["\u003c|file_sep|\u003e","\u003c|repo_name|\u003e","\u003c|fim_prefix|\u003e","\n"],"seed":69420,"temperature":0.3,"top_k":50,"top_p":0.8,"watermark":false,"details":true},"stream":false}

Response

{"generated_text":"set_api_key(config.elevenlabs_api_key(config123456789012\\\n","details":{"finish_reason":"stop_sequence","generated_tokens":25,"seed":69420,"prefill":[],"tokens":[{"id":746,"text":"set","logprob":0.0,"special":false},{"id":11697,"text":"_api","logprob":0.0,"special":false},{"id":3097,"text":"_key","logprob":0.0,"special":false},{"id":8754,"text":"(config","logprob":0.0,"special":false},{"id":1734,"text":".e","logprob":0.0,"special":false},{"id":273,"text":"le","logprob":0.0,"special":false},{"id":1037,"text":"ven","logprob":0.0,"special":false},{"id":70271,"text":"labs","logprob":0.0,"special":false},{"id":11697,"text":"_api","logprob":0.0,"special":false},{"id":3097,"text":"_key","logprob":0.0,"special":false},{"id":8754,"text":"(config","logprob":0.0,"special":false},{"id":16,"text":"1","logprob":0.0,"special":false},{"id":17,"text":"2","logprob":0.0,"special":false},{"id":18,"text":"3","logprob":0.0,"special":false},{"id":19,"text":"4","logprob":0.0,"special":false},{"id":20,"text":"5","
logprob":0.0,"special":false},{"id":21,"text":"6","logprob":0.0,"special":false},{"id":22,"text":"7","logprob":0.0,"special":false},{"id":23,"text":"8","logprob":0.0,"special":false},{"id":24,"text":"9","logprob":0.0,"special":false},{"id":15,"text":"0","logprob":0.0,"special":false},{"id":16,"text":"1","logprob":-0.3125,"special":false},{"id":17,"text":"2","logprob":0.0,"special":false},{"id":59,"text":"\\","logprob":-2.078125,"special":false},{"id":198,"text":"\n","logprob":0.0,"special":false}]}}

Expected behavior

Same response with the same call to the same tgi server regardless of the macine

Oct 18 '24 21:10 sam-ulrich1

You should be able to disable prefix caching by starting the server with PREFIX_CACHING=0. That's how I got the llama 3.2 vision models to work.

Oct 21 '24 16:10 claudioMontanari

Sweet I'll give that a try

Oct 21 '24 20:10 sam-ulrich1

This unfortunately did not work for me on the docker image

Oct 21 '24 20:10 sam-ulrich1

waiting of #2676 to validate if this is a prefix caching issue but I have confirmed with LOG_LEVEL=debug that the exact same params and input render different results with seed set

Oct 21 '24 21:10 sam-ulrich1

After disabling prefix caching I seem to be getting the same response across different different machines

Oct 21 '24 22:10 sam-ulrich1