Unable to start TGI with llama3-70b
System Info
We are using the official TGI docker image, versions 1.4.4 and 2.0.1 (tried with both) The model is https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct
Running on Google GKE with 4 vCPU, 64 GB RAM and 2 L4 GPUs. I have tried running with eetq quantization and not quantization at all.
These are the env variables:
- name: MODEL_ID
value: meta-llama/Meta-Llama-3-70B-Instruct
- name: JSON_OUTPUT
value: 'true'
- name: MAX_TOTAL_TOKENS
value: '4096'
- name: MAX_INPUT_LENGTH
value: '2048'
- name: NUM_SHARD
value: '2'
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
- I just start the pod with the official TGI image and the mentioned model
- After successfully downloading the model, TGI just continues outputting all the time
{"timestamp":"2024-04-25T09:28:16.112934Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-04-25T09:28:16.300343Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
Expected behavior
I expect the server to start up, as all the other models work in our environment
Update: it seems to be loading with 4 L4 GPUs and bitsandbytes quantization, but it takes like 2h of waiting (printing the "Waiting for shard to be ready..." message). What's happening in this time and is there any way to speed it up?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.