text-generation-inference Unable to start TGI with llama3-70b

System Info

We are using the official TGI docker image, versions 1.4.4 and 2.0.1 (tried with both) The model is https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct

Running on Google GKE with 4 vCPU, 64 GB RAM and 2 L4 GPUs. I have tried running with eetq quantization and not quantization at all.

These are the env variables:

            - name: MODEL_ID
              value: meta-llama/Meta-Llama-3-70B-Instruct
            - name: JSON_OUTPUT
              value: 'true'
            - name: MAX_TOTAL_TOKENS
              value: '4096'
            - name: MAX_INPUT_LENGTH
              value: '2048'
            - name: NUM_SHARD
              value: '2'

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

I just start the pod with the official TGI image and the mentioned model
After successfully downloading the model, TGI just continues outputting all the time

{"timestamp":"2024-04-25T09:28:16.112934Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-04-25T09:28:16.300343Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}

Expected behavior

I expect the server to start up, as all the other models work in our environment

Apr 25 '24 09:04 jegork

Update: it seems to be loading with 4 L4 GPUs and bitsandbytes quantization, but it takes like 2h of waiting (printing the "Waiting for shard to be ready..." message). What's happening in this time and is there any way to speed it up?

Apr 26 '24 09:04 jegork

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 27 '24 01:05 github-actions[bot]