text-generation-inference Configuable NCCL timeout

Feature request

Is there a way to make the NCCL timeout configurable as we often get timeout problems with the starcoder model? [E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1288224, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 66470 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

https://github.com/huggingface/text-generation-inference/blob/5a1512c0253e759fb07142029127292d639ab117/server/text_generation_server/utils/dist.py#L53

Motivation

That is to fix the timeout problem with NCCL

Your contribution

It is quite easy to make it configurable with an env variable e.g., (NCCL_TIMEOUT). If that is ok, I can create a PR.

Jul 20 '23 02:07 tienthanhdhcn

Increasing the timeout will only make you crash later, it will not fix the issue. When do you have this issue?

Jul 20 '23 06:07 OlivierDehaene

Thanks @OlivierDehaene , it is quite random, and no specific input leading to it. And the docker container is crashed after that.

Jul 20 '23 07:07 tienthanhdhcn

Usually NCCL timeouts because one of the shard OOMs and the other shard end up waiting indefinetly for the OOMed shard. Can you check if this is what is happening in your case? The solution then is to decrease max-batch-total-tokens.

Jul 20 '23 07:07 OlivierDehaene

I faced this issue today as well. After digging around for sometime I found that setting environment variable NCCL_P2P_DISABLE=1 fixes the issue. Try and see if this works for you.

Jul 21 '23 12:07 rishu931997

Thank you so much, let me try it

Jul 25 '23 05:07 tienthanhdhcn

I'm having the same issue, and I can't quite figure it out.

docker run --gpus '"device=0,1"' --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN -p 8000:8000 -v /mnt/machinelearning/:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id meta-llama/Llama-2-7b-chat-hf --sharded true

I'm a bit out of my depth here, but here's what I found so far:

Both (or all) GPUs get 100% utilization, get a bit increased memory utilization, but then never more, so it doesn't seems like it's an OOM issue

It happens for our H100 PCIe, unfortunately I have nothing else to compare to. From what I can tell, P2P should be working fine and throughput is high, so disabling it neither solves the issue nor does it seem sensible for us. I've tried setting various specific NCCL_P2P_LEVELS, with no success.

Model size seems to have no impact, as you can see this happens even on models that should easily fit a single GPU.

Aug 30 '23 15:08 stefanobranco

What type of server is it? How are the H100s inter-connected?

Sep 06 '23 13:09 OlivierDehaene

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Apr 26 '24 01:04 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

Configuable NCCL timeout

Feature request

Motivation

Your contribution

text-generation-inference
text-generation-inference copied to clipboard