text-generation-inference Long waiting time for shard loading

System Info

ghcr.io/huggingface/text-generation-inference:sha-7766fee

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

[2mtext_generation_launcher [0m [2m: [0m Args { model_id: "/data/Baichuan-13B-Base", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8093, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false } [2mtext_generation_launcher [0m [2m: [0m trust_remote_code is set. Trusting that model /data/Baichuan-13B-Base do not contain malicious code. [1mdownload [0m: [2mtext_generation_launcher [0m [2m: [0m Starting download process. [2mtext_generation_launcher [0m [2m: [0m Files are already present on the host. Skipping download.

[1mdownload [0m: [2mtext_generation_launcher [0m [2m: [0m Successfully downloaded weights. [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Starting shard [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [2mtext_generation_launcher [0m [2m: [0m Server started at unix:///tmp/text-generation-server-0

[1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Shard ready in 313.631692845s [2m [3mrank [0m [2m= [0m0 [0m [2mtext_generation_launcher [0m [2m: [0m Starting Webserver [2mrouter/src/main.rs [0m [2m: [0m [2m163: [0m Could not find a fast tokenizer implementation for /data/Baichuan-13B-Base [2mrouter/src/main.rs [0m [2m: [0m [2m166: [0m Rust input length validation and truncation is disabled [2mrouter/src/main.rs [0m [2m: [0m [2m191: [0m no pipeline tag found for model /data/Baichuan-13B-Base [2mrouter/src/main.rs [0m [2m: [0m [2m210: [0m Warming up model [2mrouter/src/main.rs [0m [2m: [0m [2m221: [0m Model does not support automatic max batch total tokens [2mrouter/src/main.rs [0m [2m: [0m [2m243: [0m Setting max batch total tokens to 16000 [2mrouter/src/main.rs [0m [2m: [0m [2m244: [0m Connected

Expected behavior

Usually, the loading time for shards should only take a few seconds. I'm curious about how to troubleshoot the problem related to "timeout" style and observe the elapsed time.

Aug 04 '23 05:08 paulcx

Seeing this today as well in an AMI I created yesterday while it was working.

Docker Image: ghcr.io/huggingface/text-generation-inference:1.0.0 Model: meta-llama/Llama-2-7b-chat-hf EC2 Instance: g5.2xl nvidia drivers: 535.86.10 Cuda compilation tools, release 11.5, V11.5.119 Build cuda_11.5.r11.5/compiler.30672275_0

Aug 04 '23 20:08 jonbrouse

[2mrouter/src/main.rs [0m [2m: [0m [2m166: [0m Rust input length validation and truncation is disabled

Note: this will hurt performance The rust router cannot count tokens and will assume longest prompt all the time for scheduling.

The model you are loading seems to be a custom one, for which the loading time is not really controlled by us. Also loading times can be slow if you are using network mounted disks (which are just slow to read by definition)

Aug 07 '23 09:08 Narsil

it's inspired. I would double check the custom codes. btw, is rust router always disabled if the custom codes are used? if not, what specifications should be followed?

Aug 09 '23 04:08 paulcx

No, router requires to be able to use a Fast version of the tokenizer (so usually just having tokenizer.json present in the repo). This is because the router is written in Rust, and needs to do the counting without spinning up python.

Aug 09 '23 08:08 Narsil

After conducting some tests, I can verify @Narsil's observations regarding the Fast version of the tokenizer. The issue of encountering a "loading timeout" arises from a failure in the automatic calculation by the Rust server. Consequently, it switches to the non-fast tokenizer. If my assumption is correct, is there a method to compel the model to load instantaneously, bypassing the waiting period?

Aug 17 '23 00:08 paulcx

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Apr 16 '24 01:04 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

Long waiting time for shard loading

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard