GenAIExamples icon indicating copy to clipboard operation
GenAIExamples copied to clipboard

tgi gaudi fails with health test in ChatQnA

Open lianhao opened this issue 1 year ago • 8 comments

Using the v0.8 version of ChatQnA example, the tgi service fails with heath test.

Environment:

  • OS: ubuntu 22.04
  • Docker ce: 27.0.3
  • Gaudi sw driver: 1.16.1-c48c5b4

Steps to reproduce:

  1. run docker compose -f compose.yaml up -d by following the README
  2. run the following to test tgi-gaudi health
$ curl -v http://localhost:8008/health
*   Trying 127.0.0.1:8008...
* Connected to localhost (127.0.0.1) port 8008 (#0)
> GET /health HTTP/1.1
> Host: localhost:8008
> User-Agent: curl/7.81.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 503 Service Unavailable
< content-type: application/json
< content-length: 48
< access-control-allow-origin: *
< vary: origin
< vary: access-control-request-method
< vary: access-control-request-headers
< date: Tue, 30 Jul 2024 01:34:08 GMT
<
* Connection #0 to host localhost left intact
{"error":"unhealthy","error_type":"healthcheck"}
  1. check the tgi-gaudi container logs shows the following:
$ sudo docker compose -f docker_compose.tgi-gaudi logs tgi-service
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] /scratch-4/lianhao/docker_compose.tgi-gaudi: `version` is obsolete
tgi-gaudi-server  | 2024-07-30T01:30:55.958772Z  INFO text_generation_launcher: Args { model_id: "Intel/neural-chat-7b-v3-3", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(1024), max_total_tokens: Some(2048), waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "c36267e1761e", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
tgi-gaudi-server  | 2024-07-30T01:30:55.958880Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
tgi-gaudi-server  | 2024-07-30T01:30:56.122105Z  INFO text_generation_launcher: Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`.
tgi-gaudi-server  | 2024-07-30T01:30:56.122117Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 1074
tgi-gaudi-server  | 2024-07-30T01:30:56.122120Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
tgi-gaudi-server  | 2024-07-30T01:30:56.122198Z  INFO download: text_generation_launcher: Starting download process.
tgi-gaudi-server  | 2024-07-30T01:30:58.612197Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
tgi-gaudi-server  |
tgi-gaudi-server  | 2024-07-30T01:30:59.024916Z  INFO download: text_generation_launcher: Successfully downloaded weights.
tgi-gaudi-server  | 2024-07-30T01:30:59.025177Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
tgi-gaudi-server  | 2024-07-30T01:31:03.229349Z  INFO text_generation_launcher: CLI SHARDED = False DTYPE = bfloat16
tgi-gaudi-server  |
tgi-gaudi-server  | 2024-07-30T01:31:09.033606Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
tgi-gaudi-server  | 2024-07-30T01:31:12.336243Z  INFO shard-manager: text_generation_launcher: Shard ready in 13.310250704s rank=0
tgi-gaudi-server  | 2024-07-30T01:31:14.337458Z  INFO text_generation_launcher: Starting Webserver
tgi-gaudi-server  | 2024-07-30T01:31:14.353426Z  INFO text_generation_router: router/src/main.rs:207: Using the Hugging Face API
tgi-gaudi-server  | 2024-07-30T01:31:14.353471Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
tgi-gaudi-server  | 2024-07-30T01:31:14.562604Z  INFO text_generation_router: router/src/main.rs:500: Serving revision bdd31cf498d13782cc7497cba5896996ce429f91 of model Intel/neural-chat-7b-v3-3
tgi-gaudi-server  | 2024-07-30T01:31:14.562627Z  INFO text_generation_router: router/src/main.rs:282: Using config Some(Mistral)
tgi-gaudi-server  | 2024-07-30T01:31:14.562638Z  INFO text_generation_router: router/src/main.rs:294: Using the Hugging Face API to retrieve tokenizer config
tgi-gaudi-server  | 2024-07-30T01:31:14.566701Z  INFO text_generation_router: router/src/main.rs:343: Warming up model
tgi-gaudi-server  | 2024-07-30T01:31:14.566722Z  WARN text_generation_router: router/src/main.rs:358: Model does not support automatic max batch total tokens
tgi-gaudi-server  | 2024-07-30T01:31:14.566724Z  INFO text_generation_router: router/src/main.rs:380: Setting max batch total tokens to 16000
tgi-gaudi-server  | 2024-07-30T01:31:14.566726Z  INFO text_generation_router: router/src/main.rs:381: Connected
tgi-gaudi-server  | 2024-07-30T01:31:14.566728Z  WARN text_generation_router: router/src/main.rs:395: Invalid hostname, defaulting to 0.0.0.0
tgi-gaudi-server  | 2024-07-30T01:34:08.253441Z ERROR health:prefill{id=18446744073709551615 size=1}:prefill{id=18446744073709551615 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: PAD_SEQUENCE_TO_MULTIPLE_OF cannot be higher than max_input_length

The Xeon version doesn't have this issue.

lianhao avatar Jul 30 '24 01:07 lianhao