text-generation-inference
text-generation-inference copied to clipboard
Configuable NCCL timeout
Feature request
Is there a way to make the NCCL timeout configurable as we often get timeout problems with the starcoder model?
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1288224, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 66470 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
https://github.com/huggingface/text-generation-inference/blob/5a1512c0253e759fb07142029127292d639ab117/server/text_generation_server/utils/dist.py#L53
Motivation
That is to fix the timeout problem with NCCL
Your contribution
It is quite easy to make it configurable with an env variable e.g., (NCCL_TIMEOUT). If that is ok, I can create a PR.
Increasing the timeout will only make you crash later, it will not fix the issue. When do you have this issue?
Thanks @OlivierDehaene , it is quite random, and no specific input leading to it. And the docker container is crashed after that.
Usually NCCL timeouts because one of the shard OOMs and the other shard end up waiting indefinetly for the OOMed shard. Can you check if this is what is happening in your case?
The solution then is to decrease max-batch-total-tokens
.
I faced this issue today as well. After digging around for sometime I found that setting environment variable NCCL_P2P_DISABLE=1 fixes the issue. Try and see if this works for you.
Thank you so much, let me try it
I'm having the same issue, and I can't quite figure it out.
docker run --gpus '"device=0,1"' --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN -p 8000:8000 -v /mnt/machinelearning/:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id meta-llama/Llama-2-7b-chat-hf --sharded true
I'm a bit out of my depth here, but here's what I found so far:
Both (or all) GPUs get 100% utilization, get a bit increased memory utilization, but then never more, so it doesn't seems like it's an OOM issue
It happens for our H100 PCIe, unfortunately I have nothing else to compare to. From what I can tell, P2P should be working fine and throughput is high, so disabling it neither solves the issue nor does it seem sensible for us. I've tried setting various specific NCCL_P2P_LEVELS, with no success.
Model size seems to have no impact, as you can see this happens even on models that should easily fit a single GPU.
What type of server is it? How are the H100s inter-connected?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.