text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Configuable NCCL timeout

Open tienthanhdhcn opened this issue 1 year ago • 7 comments

Feature request

Is there a way to make the NCCL timeout configurable as we often get timeout problems with the starcoder model? [E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1288224, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 66470 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

https://github.com/huggingface/text-generation-inference/blob/5a1512c0253e759fb07142029127292d639ab117/server/text_generation_server/utils/dist.py#L53

Motivation

That is to fix the timeout problem with NCCL

Your contribution

It is quite easy to make it configurable with an env variable e.g., (NCCL_TIMEOUT). If that is ok, I can create a PR.

tienthanhdhcn avatar Jul 20 '23 02:07 tienthanhdhcn

Increasing the timeout will only make you crash later, it will not fix the issue. When do you have this issue?

OlivierDehaene avatar Jul 20 '23 06:07 OlivierDehaene

Thanks @OlivierDehaene , it is quite random, and no specific input leading to it. And the docker container is crashed after that.

tienthanhdhcn avatar Jul 20 '23 07:07 tienthanhdhcn

Usually NCCL timeouts because one of the shard OOMs and the other shard end up waiting indefinetly for the OOMed shard. Can you check if this is what is happening in your case? The solution then is to decrease max-batch-total-tokens.

OlivierDehaene avatar Jul 20 '23 07:07 OlivierDehaene

I faced this issue today as well. After digging around for sometime I found that setting environment variable NCCL_P2P_DISABLE=1 fixes the issue. Try and see if this works for you.

rishu931997 avatar Jul 21 '23 12:07 rishu931997

Thank you so much, let me try it

tienthanhdhcn avatar Jul 25 '23 05:07 tienthanhdhcn

I'm having the same issue, and I can't quite figure it out.

docker run --gpus '"device=0,1"' --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN -p 8000:8000 -v /mnt/machinelearning/:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id meta-llama/Llama-2-7b-chat-hf --sharded true

I'm a bit out of my depth here, but here's what I found so far:

Both (or all) GPUs get 100% utilization, get a bit increased memory utilization, but then never more, so it doesn't seems like it's an OOM issue

image

It happens for our H100 PCIe, unfortunately I have nothing else to compare to. From what I can tell, P2P should be working fine and throughput is high, so disabling it neither solves the issue nor does it seem sensible for us. I've tried setting various specific NCCL_P2P_LEVELS, with no success.

Model size seems to have no impact, as you can see this happens even on models that should easily fit a single GPU.

stefanobranco avatar Aug 30 '23 15:08 stefanobranco

What type of server is it? How are the H100s inter-connected?

OlivierDehaene avatar Sep 06 '23 13:09 OlivierDehaene

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Apr 26 '24 01:04 github-actions[bot]