GenAIExamples icon indicating copy to clipboard operation
GenAIExamples copied to clipboard

tgi gaudi fails with health test in ChatQnA

Open lianhao opened this issue 1 year ago • 8 comments

Using the v0.8 version of ChatQnA example, the tgi service fails with heath test.

Environment:

  • OS: ubuntu 22.04
  • Docker ce: 27.0.3
  • Gaudi sw driver: 1.16.1-c48c5b4

Steps to reproduce:

  1. run docker compose -f compose.yaml up -d by following the README
  2. run the following to test tgi-gaudi health
$ curl -v http://localhost:8008/health
*   Trying 127.0.0.1:8008...
* Connected to localhost (127.0.0.1) port 8008 (#0)
> GET /health HTTP/1.1
> Host: localhost:8008
> User-Agent: curl/7.81.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 503 Service Unavailable
< content-type: application/json
< content-length: 48
< access-control-allow-origin: *
< vary: origin
< vary: access-control-request-method
< vary: access-control-request-headers
< date: Tue, 30 Jul 2024 01:34:08 GMT
<
* Connection #0 to host localhost left intact
{"error":"unhealthy","error_type":"healthcheck"}
  1. check the tgi-gaudi container logs shows the following:
$ sudo docker compose -f docker_compose.tgi-gaudi logs tgi-service
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] /scratch-4/lianhao/docker_compose.tgi-gaudi: `version` is obsolete
tgi-gaudi-server  | 2024-07-30T01:30:55.958772Z  INFO text_generation_launcher: Args { model_id: "Intel/neural-chat-7b-v3-3", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(1024), max_total_tokens: Some(2048), waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "c36267e1761e", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
tgi-gaudi-server  | 2024-07-30T01:30:55.958880Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
tgi-gaudi-server  | 2024-07-30T01:30:56.122105Z  INFO text_generation_launcher: Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`.
tgi-gaudi-server  | 2024-07-30T01:30:56.122117Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 1074
tgi-gaudi-server  | 2024-07-30T01:30:56.122120Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
tgi-gaudi-server  | 2024-07-30T01:30:56.122198Z  INFO download: text_generation_launcher: Starting download process.
tgi-gaudi-server  | 2024-07-30T01:30:58.612197Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
tgi-gaudi-server  |
tgi-gaudi-server  | 2024-07-30T01:30:59.024916Z  INFO download: text_generation_launcher: Successfully downloaded weights.
tgi-gaudi-server  | 2024-07-30T01:30:59.025177Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
tgi-gaudi-server  | 2024-07-30T01:31:03.229349Z  INFO text_generation_launcher: CLI SHARDED = False DTYPE = bfloat16
tgi-gaudi-server  |
tgi-gaudi-server  | 2024-07-30T01:31:09.033606Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
tgi-gaudi-server  | 2024-07-30T01:31:12.336243Z  INFO shard-manager: text_generation_launcher: Shard ready in 13.310250704s rank=0
tgi-gaudi-server  | 2024-07-30T01:31:14.337458Z  INFO text_generation_launcher: Starting Webserver
tgi-gaudi-server  | 2024-07-30T01:31:14.353426Z  INFO text_generation_router: router/src/main.rs:207: Using the Hugging Face API
tgi-gaudi-server  | 2024-07-30T01:31:14.353471Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
tgi-gaudi-server  | 2024-07-30T01:31:14.562604Z  INFO text_generation_router: router/src/main.rs:500: Serving revision bdd31cf498d13782cc7497cba5896996ce429f91 of model Intel/neural-chat-7b-v3-3
tgi-gaudi-server  | 2024-07-30T01:31:14.562627Z  INFO text_generation_router: router/src/main.rs:282: Using config Some(Mistral)
tgi-gaudi-server  | 2024-07-30T01:31:14.562638Z  INFO text_generation_router: router/src/main.rs:294: Using the Hugging Face API to retrieve tokenizer config
tgi-gaudi-server  | 2024-07-30T01:31:14.566701Z  INFO text_generation_router: router/src/main.rs:343: Warming up model
tgi-gaudi-server  | 2024-07-30T01:31:14.566722Z  WARN text_generation_router: router/src/main.rs:358: Model does not support automatic max batch total tokens
tgi-gaudi-server  | 2024-07-30T01:31:14.566724Z  INFO text_generation_router: router/src/main.rs:380: Setting max batch total tokens to 16000
tgi-gaudi-server  | 2024-07-30T01:31:14.566726Z  INFO text_generation_router: router/src/main.rs:381: Connected
tgi-gaudi-server  | 2024-07-30T01:31:14.566728Z  WARN text_generation_router: router/src/main.rs:395: Invalid hostname, defaulting to 0.0.0.0
tgi-gaudi-server  | 2024-07-30T01:34:08.253441Z ERROR health:prefill{id=18446744073709551615 size=1}:prefill{id=18446744073709551615 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: PAD_SEQUENCE_TO_MULTIPLE_OF cannot be higher than max_input_length

The Xeon version doesn't have this issue.

lianhao avatar Jul 30 '24 01:07 lianhao

Thanks for bring up this issue. We will try reproducing this issue and look into it.

YuningQiu avatar Aug 05 '24 15:08 YuningQiu

Is there a way to extend timeout to wait for shard to be ready longer than 10minutes? I have similar issue while running OPEA's ChatQna deployed as kata-qemu-tdx (with TDX protection for PODs) and looks like in this case it is not enough for TGI service to wait only 10 minutes for shards to be ready. Even if memory for the pod is set to 128GB i can see that some 10 minutes timeout occurs and see below logs: │ {"timestamp":"2024-08-28T13:57:27.484938Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"} │ │ {"timestamp":"2024-08-28T13:57:27.533891Z","level":"INFO","fields":{"message":"Terminating shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]} │ │ {"timestamp":"2024-08-28T13:57:27.533945Z","level":"INFO","fields":{"message":"Waiting for shard to gracefully shutdown"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]} │ │ {"timestamp":"2024-08-28T13:57:27.836640Z","level":"INFO","fields":{"message":"shard terminated"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]} │ Than whole POD (which is running but is unhealthy) is restarted again and again

ksandowi avatar Aug 28 '24 15:08 ksandowi

You may increase the failureThreshold number of the corresponding chatqna-tgi pod's startupProbe, something like https://github.com/opea-project/GenAIExamples/blob/6f3e54a22a1800570eab0291b9325946e8f02288/ChatQnA/kubernetes/manifests/xeon/chatqna.yaml#L1148

lianhao avatar Aug 29 '24 01:08 lianhao

Thanks a lot. I extended it to 40min, but unfortunately shards preparation haven't finish within this time if I deploy TGI service as kata-qemu-tdx (with TDX protection). Any hint how to speed up shards preparation? More memory assigned to the service? I already used 32GB.

ksandowi avatar Aug 30 '24 09:08 ksandowi

Thanks a lot. I extended it to 40min, but unfortunately shards preparation haven't finish within this time if I deploy TGI service as kata-qemu-tdx (with TDX protection). Any hint how to speed up shards preparation? More memory assigned to the service? I already used 32GB.

Is the model data already present so that kata-qemu-tgx container can access it? Or does TGI try to download it from network?

eero-t avatar Oct 02 '24 14:10 eero-t

TD VM (kata-qemu-tdx) pod is created without persistent storage, so while deploing new TGI pod, it has to download data model from network. Each instance of TGI service will do this separately. BTW, to run TGI in TD VM, OPEA v1.0 requires patch that consists of:

       labels:
         app.kubernetes.io/name: tgi
         app.kubernetes.io/instance: chatqna
+      annotations:
+        io.katacontainers.config.runtime.create_container_timeout: "800"
     spec:
+      runtimeClassName: kata-qemu-tdx
       securityContext:

and additionally

             startupProbe:
-            failureThreshold: 120
+            failureThreshold: 240
             initialDelaySeconds: 5
             periodSeconds: 5
             tcpSocket:
               port: http
           resources:
-            {}
+            limits:
+              memory: "80Gi"

ksandowi avatar Oct 02 '24 16:10 ksandowi

TD VM (kata-qemu-tdx) pod is created without persistent storage, so while deploing new TGI pod, it has to download data model from network.

I assume TDX is used for security reasons? Besides obvious slowdown, downloading data from internet on every container startup is not really a way to work if one wants to be secure. Isn't there any way to provide storage volume for those?

eero-t avatar Oct 02 '24 17:10 eero-t

Yes, for security reason. Persistent storage should be used to share data model among multiple TGI replicas. Even for single TGI pod needs to download a data model (>32GB) from network, which takes time even without TDX. It influence into container creation, so timeout needs to be specified to avoid container restart while data model is downloading

ksandowi avatar Oct 02 '24 17:10 ksandowi

@lianhao @eero-t @ksandowi

We have move to V1.2, and there is NO TGI Gaudi issues.

Is this issue valid? Could we close it and open a new one if there is issue in V1.2?

xiguiw avatar Mar 04 '25 00:03 xiguiw

I agree that this can be closed. We already implemented ChatQnA deployment #799 with TDX support enabled successfully

ksandowi avatar Mar 04 '25 14:03 ksandowi