text-generation-inference Cannot deploy Falcon-40B-instruct server because of low fixed timeout on startup

System Info

Problem

Using the 0.8 (0.8.2) container with --model-id tiiuae/falcon-40b-instruct --num-shard 2 on runpod.io with 2xA100 80GB

On startup it starts loading the 2 shards but they timeout after about 67 seconds (60 sec timeout set). This causes the whole process to restart and the container is locked in this constant restarting.

Suggestion

I found this part in the code that seems to fix the timeout at 60 on init. Could this be the problem?

Could this number be an ENV var perhaps (with a fallback at 60 seconds or maybe even higher).

Output

2023-06-23T08:54:08.289822474Z 2023-06-23T08:54:08.289681Z  INFO text_generation_launcher: Args { model_id: "tiiuae/falcon-40b-instruct", revision: None, sharded: None, num_shard: Some(2), quantize: Some(Bitsandbytes), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false }
2023-06-23T08:54:08.289872253Z 2023-06-23T08:54:08.289704Z  INFO text_generation_launcher: Sharding model on 2 processes
2023-06-23T08:54:08.289882213Z 2023-06-23T08:54:08.289801Z  INFO text_generation_launcher: Starting download process.
2023-06-23T08:54:10.365830142Z 2023-06-23T08:54:10.365629Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.
2023-06-23T08:54:10.365864332Z 
2023-06-23T08:54:10.792246500Z 2023-06-23T08:54:10.792086Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-06-23T08:54:10.792395920Z 2023-06-23T08:54:10.792330Z  INFO text_generation_launcher: Starting shard 1
2023-06-23T08:54:10.792400230Z 2023-06-23T08:54:10.792334Z  INFO text_generation_launcher: Starting shard 0
2023-06-23T08:54:20.800596123Z 2023-06-23T08:54:20.800443Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-23T08:54:20.801631618Z 2023-06-23T08:54:20.801542Z  INFO text_generation_launcher: Waiting for shard 1 to be ready...
2023-06-23T08:54:30.807225193Z 2023-06-23T08:54:30.807040Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-23T08:54:30.809943610Z 2023-06-23T08:54:30.809854Z  INFO text_generation_launcher: Waiting for shard 1 to be ready...
2023-06-23T08:54:40.814816779Z 2023-06-23T08:54:40.814398Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-23T08:54:40.819149738Z 2023-06-23T08:54:40.818776Z  INFO text_generation_launcher: Waiting for shard 1 to be ready...
2023-06-23T08:54:50.822781354Z 2023-06-23T08:54:50.822611Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-23T08:54:50.826902264Z 2023-06-23T08:54:50.826831Z  INFO text_generation_launcher: Waiting for shard 1 to be ready...
2023-06-23T08:55:00.829537554Z 2023-06-23T08:55:00.829361Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-23T08:55:00.834434350Z 2023-06-23T08:55:00.834324Z  INFO text_generation_launcher: Waiting for shard 1 to be ready...
2023-06-23T08:55:10.836496773Z 2023-06-23T08:55:10.836365Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-23T08:55:10.841911817Z 2023-06-23T08:55:10.841828Z  INFO text_generation_launcher: Waiting for shard 1 to be ready...
2023-06-23T08:55:20.844219108Z 2023-06-23T08:55:20.844059Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-23T08:55:20.852337909Z 2023-06-23T08:55:20.852236Z  INFO text_generation_launcher: Waiting for shard 1 to be ready...
2023-06-23T08:55:25.045755469Z 2023-06-23T08:55:25.045575Z ERROR text_generation_launcher: Shard 0 failed to start:
2023-06-23T08:55:25.045794859Z You are using a model of type RefinedWeb to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
2023-06-23T08:55:25.045812799Z [E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 67443 milliseconds before timing out.
2023-06-23T08:55:25.045815159Z [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
2023-06-23T08:55:25.045816199Z [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
2023-06-23T08:55:25.045817899Z terminate called after throwing an instance of 'std::runtime_error'
2023-06-23T08:55:25.045819379Z   what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 67443 milliseconds before timing out.
2023-06-23T08:55:25.045822499Z 
2023-06-23T08:55:25.045823559Z 2023-06-23T08:55:25.045598Z  INFO text_generation_launcher: Shutting down shards
2023-06-23T08:55:25.119128939Z 2023-06-23T08:55:25.119042Z  INFO text_generation_launcher: Shard 1 terminated
2023-06-23T08:55:25.119164069Z Error: ShardCannotStart
2023-06-23T08:55:41.260562249Z 2023-06-23T08:55:41.260341Z  INFO text_generation_launcher: Args { model_id: "tiiuae/falcon-40b-instruct", revision: None, sharded: None, num_shard: Some(2), quantize: Some(Bitsandbytes), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false }
2023-06-23T08:55:41.260591119Z 2023-06-23T08:55:41.260374Z  INFO text_generation_launcher: Sharding model on 2 processes
2023-06-23T08:55:41.260594719Z 2023-06-23T08:55:41.260474Z  INFO text_generation_launcher: Starting download process.
2023-06-23T08:55:44.567652111Z 2023-06-23T08:55:44.567452Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.
2023-06-23T08:55:44.567685941Z 
2023-06-23T08:55:44.964043622Z 2023-06-23T08:55:44.963902Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-06-23T08:55:44.964142082Z 2023-06-23T08:55:44.964074Z  INFO text_generation_launcher: Starting shard 0
2023-06-23T08:55:44.964423880Z 2023-06-23T08:55:44.964368Z  INFO text_generation_launcher: Starting shard 1

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Create a 2xA100 80GB pod on runpod.io
Start 0.8.2 container with --model-id tiiuae/falcon-40b-instruct --num-shard 2 on runpod.io with 2xA100 80GB`

Expected behavior

I also tried to load the 7B model on the same pod with a single A100 and it worked. It took about 47 seconds to start and therefore I guess the 40B would need a lot more time.

Jun 23 '23 09:06 mzperix

I had the exact same problem. I am starting to believe it was an issue with RunPod, specifically the Canadian A100 80GB instances. My coworker was able to run the exact same configuration on 2x RunPod A100 80GB, and it worked fine.

I had also done the same thing a few weeks ago and it was working.

So either a transient TGI problem, or a RunPod problem

Jun 23 '23 14:06 PatrickNercessian

Would be great if options._timeout = timedelta(seconds=60) could be set by user during the deployment with default 60.

Jun 23 '23 16:06 maziyarpanahi

I've also experienced this, and reproduced it today with Runpod A100 80GB instances in the Romania region. Could it be a race condition due to slow connections between GPU's or a slower connection to the disk? When I try it with A100 80GB SXM GPU's on Runpod it works properly, but when I try it on non-SXM GPU's (also on RunPod) I get the same frozen loading problem that you've described.

Jun 26 '23 09:06 ssmi153

Did you properly set --shm-size 1g ?

Jun 26 '23 09:06 Narsil

After a long back-and-forth I did learn it is not possible to set that argument on runpod.

I will look for other providers but in the meantime is it at all possible this fixed timeout may still be causing problems down the line?

Jun 27 '23 13:06 mzperix

Not really 60s for cross GPU communication is really A LOT.

Here allowing for a longer timeout will not help, since the cards just cannot communicate.

Jun 27 '23 14:06 Narsil

was it a problem on runpod? I guess something wrong occurred in their GPU clusters if requesting multiple of them. Since it's NCCL on the backend.

Jun 27 '23 14:06 kalvin1024

I struggled to run falcon-7b on a Runpod server.

Did you properly set --shm-size 1g ?

Does runpod even let you set --shm-size? The server config form doesn't seem to let you run arbitrary docker run commands. It gives you a few flags to configure (volume path, exposed ports). The next field, "Docker Command" corresponds to the COMMAND field after IMAGE in man docker run.

I set NCCL_SHM_DISABLE=1 in the environment variables menu, hoping that with just one GPU, everything would work.

(RunPod does not support docker compose, so no way to get the setting in that way.)

Jun 28 '23 08:06 b-adkins

Any chance that this convo can happen on runpod instead of here?

Jun 28 '23 09:06 OlivierDehaene

Coming here from RunPod trying to resolve the issue, still have not pinned it down yet. Not a shm-size issue as we set it way above 1g.

Jul 10 '23 20:07 justinmerrell

@justinmerrell, thanks for looking into this! While you're here, where's the best place to raise other text-generation-inference x Runpod compatibility issues? Because the default Runpod containers don't contain Nvidia NVCC, text-generation-inference is unable to compile its optimised CUDA kernels, so doesn't run as fast as it otherwise should be able to (though it's still extremely fast). Running the text-generation-inference docker container on Runpod is otherwise super easy and convenient, so it would be great to be able to get the most out of this.

Jul 11 '23 09:07 ssmi153

For RunPod-specific discussion, feel free to ping us on our Discord.

As far as the original issue here, I am investigating if this is an issue relating to IOMMU being enabled on the host machine, if this turns out to be the issue, I am surprised I do not see more comments mentioning BIOS settings.

Jul 11 '23 14:07 justinmerrell

Runpod tech support said that by default every container has --shm-size set to 50% of the total available RAM.

Sep 22 '23 17:09 askaydevs

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 17 '24 01:05 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

Cannot deploy Falcon-40B-instruct server because of low fixed timeout on startup

System Info

Problem

Suggestion

Output

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard