GenAIExamples
GenAIExamples copied to clipboard
tgi gaudi fails with health test in ChatQnA
Using the v0.8 version of ChatQnA example, the tgi service fails with heath test.
Environment:
- OS: ubuntu 22.04
- Docker ce: 27.0.3
- Gaudi sw driver: 1.16.1-c48c5b4
Steps to reproduce:
- run
docker compose -f compose.yaml up -dby following the README - run the following to test tgi-gaudi health
$ curl -v http://localhost:8008/health
* Trying 127.0.0.1:8008...
* Connected to localhost (127.0.0.1) port 8008 (#0)
> GET /health HTTP/1.1
> Host: localhost:8008
> User-Agent: curl/7.81.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 503 Service Unavailable
< content-type: application/json
< content-length: 48
< access-control-allow-origin: *
< vary: origin
< vary: access-control-request-method
< vary: access-control-request-headers
< date: Tue, 30 Jul 2024 01:34:08 GMT
<
* Connection #0 to host localhost left intact
{"error":"unhealthy","error_type":"healthcheck"}
- check the tgi-gaudi container logs shows the following:
$ sudo docker compose -f docker_compose.tgi-gaudi logs tgi-service
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] /scratch-4/lianhao/docker_compose.tgi-gaudi: `version` is obsolete
tgi-gaudi-server | 2024-07-30T01:30:55.958772Z INFO text_generation_launcher: Args { model_id: "Intel/neural-chat-7b-v3-3", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(1024), max_total_tokens: Some(2048), waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "c36267e1761e", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
tgi-gaudi-server | 2024-07-30T01:30:55.958880Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
tgi-gaudi-server | 2024-07-30T01:30:56.122105Z INFO text_generation_launcher: Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`.
tgi-gaudi-server | 2024-07-30T01:30:56.122117Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 1074
tgi-gaudi-server | 2024-07-30T01:30:56.122120Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
tgi-gaudi-server | 2024-07-30T01:30:56.122198Z INFO download: text_generation_launcher: Starting download process.
tgi-gaudi-server | 2024-07-30T01:30:58.612197Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
tgi-gaudi-server |
tgi-gaudi-server | 2024-07-30T01:30:59.024916Z INFO download: text_generation_launcher: Successfully downloaded weights.
tgi-gaudi-server | 2024-07-30T01:30:59.025177Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
tgi-gaudi-server | 2024-07-30T01:31:03.229349Z INFO text_generation_launcher: CLI SHARDED = False DTYPE = bfloat16
tgi-gaudi-server |
tgi-gaudi-server | 2024-07-30T01:31:09.033606Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
tgi-gaudi-server | 2024-07-30T01:31:12.336243Z INFO shard-manager: text_generation_launcher: Shard ready in 13.310250704s rank=0
tgi-gaudi-server | 2024-07-30T01:31:14.337458Z INFO text_generation_launcher: Starting Webserver
tgi-gaudi-server | 2024-07-30T01:31:14.353426Z INFO text_generation_router: router/src/main.rs:207: Using the Hugging Face API
tgi-gaudi-server | 2024-07-30T01:31:14.353471Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
tgi-gaudi-server | 2024-07-30T01:31:14.562604Z INFO text_generation_router: router/src/main.rs:500: Serving revision bdd31cf498d13782cc7497cba5896996ce429f91 of model Intel/neural-chat-7b-v3-3
tgi-gaudi-server | 2024-07-30T01:31:14.562627Z INFO text_generation_router: router/src/main.rs:282: Using config Some(Mistral)
tgi-gaudi-server | 2024-07-30T01:31:14.562638Z INFO text_generation_router: router/src/main.rs:294: Using the Hugging Face API to retrieve tokenizer config
tgi-gaudi-server | 2024-07-30T01:31:14.566701Z INFO text_generation_router: router/src/main.rs:343: Warming up model
tgi-gaudi-server | 2024-07-30T01:31:14.566722Z WARN text_generation_router: router/src/main.rs:358: Model does not support automatic max batch total tokens
tgi-gaudi-server | 2024-07-30T01:31:14.566724Z INFO text_generation_router: router/src/main.rs:380: Setting max batch total tokens to 16000
tgi-gaudi-server | 2024-07-30T01:31:14.566726Z INFO text_generation_router: router/src/main.rs:381: Connected
tgi-gaudi-server | 2024-07-30T01:31:14.566728Z WARN text_generation_router: router/src/main.rs:395: Invalid hostname, defaulting to 0.0.0.0
tgi-gaudi-server | 2024-07-30T01:34:08.253441Z ERROR health:prefill{id=18446744073709551615 size=1}:prefill{id=18446744073709551615 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: PAD_SEQUENCE_TO_MULTIPLE_OF cannot be higher than max_input_length
The Xeon version doesn't have this issue.