text-generation-inference
text-generation-inference copied to clipboard
Long waiting time for shard loading
System Info
ghcr.io/huggingface/text-generation-inference:sha-7766fee
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
[2mtext_generation_launcher [0m [2m: [0m Args { model_id: "/data/Baichuan-13B-Base", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8093, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
[2mtext_generation_launcher [0m [2m: [0m trust_remote_code is set. Trusting that model /data/Baichuan-13B-Base do not contain malicious code.
[1mdownload [0m: [2mtext_generation_launcher [0m [2m: [0m Starting download process.
[2mtext_generation_launcher [0m [2m: [0m Files are already present on the host. Skipping download.
[1mdownload [0m: [2mtext_generation_launcher [0m [2m: [0m Successfully downloaded weights. [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Starting shard [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Waiting for shard to be ready... [2m [3mrank [0m [2m= [0m0 [0m [2mtext_generation_launcher [0m [2m: [0m Server started at unix:///tmp/text-generation-server-0
[1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Shard ready in 313.631692845s [2m [3mrank [0m [2m= [0m0 [0m [2mtext_generation_launcher [0m [2m: [0m Starting Webserver [2mrouter/src/main.rs [0m [2m: [0m [2m163: [0m Could not find a fast tokenizer implementation for /data/Baichuan-13B-Base [2mrouter/src/main.rs [0m [2m: [0m [2m166: [0m Rust input length validation and truncation is disabled [2mrouter/src/main.rs [0m [2m: [0m [2m191: [0m no pipeline tag found for model /data/Baichuan-13B-Base [2mrouter/src/main.rs [0m [2m: [0m [2m210: [0m Warming up model [2mrouter/src/main.rs [0m [2m: [0m [2m221: [0m Model does not support automatic max batch total tokens [2mrouter/src/main.rs [0m [2m: [0m [2m243: [0m Setting max batch total tokens to 16000 [2mrouter/src/main.rs [0m [2m: [0m [2m244: [0m Connected
Expected behavior
Usually, the loading time for shards should only take a few seconds. I'm curious about how to troubleshoot the problem related to "timeout" style and observe the elapsed time.
Seeing this today as well in an AMI I created yesterday while it was working.
Docker Image: ghcr.io/huggingface/text-generation-inference:1.0.0 Model: meta-llama/Llama-2-7b-chat-hf EC2 Instance: g5.2xl nvidia drivers: 535.86.10 Cuda compilation tools, release 11.5, V11.5.119 Build cuda_11.5.r11.5/compiler.30672275_0
[2mrouter/src/main.rs [0m [2m: [0m [2m166: [0m Rust input length validation and truncation is disabled
Note: this will hurt performance The rust router cannot count tokens and will assume longest prompt all the time for scheduling.
The model you are loading seems to be a custom one, for which the loading time is not really controlled by us. Also loading times can be slow if you are using network mounted disks (which are just slow to read by definition)
it's inspired. I would double check the custom codes. btw, is rust router always disabled if the custom codes are used? if not, what specifications should be followed?
No, router requires to be able to use a Fast version of the tokenizer (so usually just having tokenizer.json present in the repo). This is because the router is written in Rust, and needs to do the counting without spinning up python.
After conducting some tests, I can verify @Narsil's observations regarding the Fast version of the tokenizer. The issue of encountering a "loading timeout" arises from a failure in the automatic calculation by the Rust server. Consequently, it switches to the non-fast tokenizer. If my assumption is correct, is there a method to compel the model to load instantaneously, bypassing the waiting period?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.