GenAIExamples
GenAIExamples copied to clipboard
tgi gaudi fails with health test in ChatQnA
Using the v0.8 version of ChatQnA example, the tgi service fails with heath test.
Environment:
- OS: ubuntu 22.04
- Docker ce: 27.0.3
- Gaudi sw driver: 1.16.1-c48c5b4
Steps to reproduce:
- run
docker compose -f compose.yaml up -dby following the README - run the following to test tgi-gaudi health
$ curl -v http://localhost:8008/health
* Trying 127.0.0.1:8008...
* Connected to localhost (127.0.0.1) port 8008 (#0)
> GET /health HTTP/1.1
> Host: localhost:8008
> User-Agent: curl/7.81.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 503 Service Unavailable
< content-type: application/json
< content-length: 48
< access-control-allow-origin: *
< vary: origin
< vary: access-control-request-method
< vary: access-control-request-headers
< date: Tue, 30 Jul 2024 01:34:08 GMT
<
* Connection #0 to host localhost left intact
{"error":"unhealthy","error_type":"healthcheck"}
- check the tgi-gaudi container logs shows the following:
$ sudo docker compose -f docker_compose.tgi-gaudi logs tgi-service
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] /scratch-4/lianhao/docker_compose.tgi-gaudi: `version` is obsolete
tgi-gaudi-server | 2024-07-30T01:30:55.958772Z INFO text_generation_launcher: Args { model_id: "Intel/neural-chat-7b-v3-3", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(1024), max_total_tokens: Some(2048), waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "c36267e1761e", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
tgi-gaudi-server | 2024-07-30T01:30:55.958880Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
tgi-gaudi-server | 2024-07-30T01:30:56.122105Z INFO text_generation_launcher: Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`.
tgi-gaudi-server | 2024-07-30T01:30:56.122117Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 1074
tgi-gaudi-server | 2024-07-30T01:30:56.122120Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
tgi-gaudi-server | 2024-07-30T01:30:56.122198Z INFO download: text_generation_launcher: Starting download process.
tgi-gaudi-server | 2024-07-30T01:30:58.612197Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
tgi-gaudi-server |
tgi-gaudi-server | 2024-07-30T01:30:59.024916Z INFO download: text_generation_launcher: Successfully downloaded weights.
tgi-gaudi-server | 2024-07-30T01:30:59.025177Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
tgi-gaudi-server | 2024-07-30T01:31:03.229349Z INFO text_generation_launcher: CLI SHARDED = False DTYPE = bfloat16
tgi-gaudi-server |
tgi-gaudi-server | 2024-07-30T01:31:09.033606Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
tgi-gaudi-server | 2024-07-30T01:31:12.336243Z INFO shard-manager: text_generation_launcher: Shard ready in 13.310250704s rank=0
tgi-gaudi-server | 2024-07-30T01:31:14.337458Z INFO text_generation_launcher: Starting Webserver
tgi-gaudi-server | 2024-07-30T01:31:14.353426Z INFO text_generation_router: router/src/main.rs:207: Using the Hugging Face API
tgi-gaudi-server | 2024-07-30T01:31:14.353471Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
tgi-gaudi-server | 2024-07-30T01:31:14.562604Z INFO text_generation_router: router/src/main.rs:500: Serving revision bdd31cf498d13782cc7497cba5896996ce429f91 of model Intel/neural-chat-7b-v3-3
tgi-gaudi-server | 2024-07-30T01:31:14.562627Z INFO text_generation_router: router/src/main.rs:282: Using config Some(Mistral)
tgi-gaudi-server | 2024-07-30T01:31:14.562638Z INFO text_generation_router: router/src/main.rs:294: Using the Hugging Face API to retrieve tokenizer config
tgi-gaudi-server | 2024-07-30T01:31:14.566701Z INFO text_generation_router: router/src/main.rs:343: Warming up model
tgi-gaudi-server | 2024-07-30T01:31:14.566722Z WARN text_generation_router: router/src/main.rs:358: Model does not support automatic max batch total tokens
tgi-gaudi-server | 2024-07-30T01:31:14.566724Z INFO text_generation_router: router/src/main.rs:380: Setting max batch total tokens to 16000
tgi-gaudi-server | 2024-07-30T01:31:14.566726Z INFO text_generation_router: router/src/main.rs:381: Connected
tgi-gaudi-server | 2024-07-30T01:31:14.566728Z WARN text_generation_router: router/src/main.rs:395: Invalid hostname, defaulting to 0.0.0.0
tgi-gaudi-server | 2024-07-30T01:34:08.253441Z ERROR health:prefill{id=18446744073709551615 size=1}:prefill{id=18446744073709551615 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: PAD_SEQUENCE_TO_MULTIPLE_OF cannot be higher than max_input_length
The Xeon version doesn't have this issue.
Thanks for bring up this issue. We will try reproducing this issue and look into it.
Is there a way to extend timeout to wait for shard to be ready longer than 10minutes? I have similar issue while running OPEA's ChatQna deployed as kata-qemu-tdx (with TDX protection for PODs) and looks like in this case it is not enough for TGI service to wait only 10 minutes for shards to be ready. Even if memory for the pod is set to 128GB i can see that some 10 minutes timeout occurs and see below logs: │ {"timestamp":"2024-08-28T13:57:27.484938Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"} │ │ {"timestamp":"2024-08-28T13:57:27.533891Z","level":"INFO","fields":{"message":"Terminating shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]} │ │ {"timestamp":"2024-08-28T13:57:27.533945Z","level":"INFO","fields":{"message":"Waiting for shard to gracefully shutdown"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]} │ │ {"timestamp":"2024-08-28T13:57:27.836640Z","level":"INFO","fields":{"message":"shard terminated"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]} │ Than whole POD (which is running but is unhealthy) is restarted again and again
You may increase the failureThreshold number of the corresponding chatqna-tgi pod's startupProbe, something like https://github.com/opea-project/GenAIExamples/blob/6f3e54a22a1800570eab0291b9325946e8f02288/ChatQnA/kubernetes/manifests/xeon/chatqna.yaml#L1148
Thanks a lot. I extended it to 40min, but unfortunately shards preparation haven't finish within this time if I deploy TGI service as kata-qemu-tdx (with TDX protection). Any hint how to speed up shards preparation? More memory assigned to the service? I already used 32GB.
Thanks a lot. I extended it to 40min, but unfortunately shards preparation haven't finish within this time if I deploy TGI service as kata-qemu-tdx (with TDX protection). Any hint how to speed up shards preparation? More memory assigned to the service? I already used 32GB.
Is the model data already present so that kata-qemu-tgx container can access it? Or does TGI try to download it from network?
TD VM (kata-qemu-tdx) pod is created without persistent storage, so while deploing new TGI pod, it has to download data model from network. Each instance of TGI service will do this separately. BTW, to run TGI in TD VM, OPEA v1.0 requires patch that consists of:
labels:
app.kubernetes.io/name: tgi
app.kubernetes.io/instance: chatqna
+ annotations:
+ io.katacontainers.config.runtime.create_container_timeout: "800"
spec:
+ runtimeClassName: kata-qemu-tdx
securityContext:
and additionally
startupProbe:
- failureThreshold: 120
+ failureThreshold: 240
initialDelaySeconds: 5
periodSeconds: 5
tcpSocket:
port: http
resources:
- {}
+ limits:
+ memory: "80Gi"
TD VM (kata-qemu-tdx) pod is created without persistent storage, so while deploing new TGI pod, it has to download data model from network.
I assume TDX is used for security reasons? Besides obvious slowdown, downloading data from internet on every container startup is not really a way to work if one wants to be secure. Isn't there any way to provide storage volume for those?
Yes, for security reason. Persistent storage should be used to share data model among multiple TGI replicas. Even for single TGI pod needs to download a data model (>32GB) from network, which takes time even without TDX. It influence into container creation, so timeout needs to be specified to avoid container restart while data model is downloading
@lianhao @eero-t @ksandowi
We have move to V1.2, and there is NO TGI Gaudi issues.
Is this issue valid? Could we close it and open a new one if there is issue in V1.2?
I agree that this can be closed. We already implemented ChatQnA deployment #799 with TDX support enabled successfully