llama
llama copied to clipboard
Can not run 13B inference model. After loading the ckpt, it just stoped and the gpus are still occupied.
Can not run 13B inference model. After loading the ckpt, it just stoped and the gpus are still occupied.

Sounds like it is indeed running the inference model. It will hang there till it actually completes its inference. If the GPU's are occupied that sounds like its doing its work? How long have you waited?
@hzy312 I am having the same problem. The script gets stuck on feeding the input into the first layer. Did you find a solution to this problem?
Sounds like it is indeed running the inference model. It will hang there till it actually completes its inference. If the GPU's are occupied that sounds like its doing its work? How long have you waited?
many minutes and the program will exit
Same problem!
Seems a potential reason is not having enough GPU? Even though 2x RTX 4090 and RTX A5000 each with 24GB should be enough, I was able to run only on A6000 with 48GB.
@andrewmlu
I don't think it has to do with the amount of vRAM, but something else related to the type of GPU (maybe the drivers?).
For me, the problem was resolved after switching from 2x Nvidia A30 (24G) to smaller Tesla T4s (16G).
I have solved the issue (for me). As far as I understand, has to do p2p communication of the GPUs via NCCL. One can deactivate p2p via setting environmental variables like so:
NCCL_P2P_DISABLE='1' NCCL_IB_DISABLE='1' torchrun --nproc_per_node 4 example.py [...]
Note that this will make the GPUs communicate via shared memory, you will therefore have to allocate (a lot) more RAM than in the standard setup. Also, inter-GPU communication might be not as fast as before.
References: https://github.com/open-mmlab/mmdetection/issues/6534 https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
I am having the same problem. Using 2 or 4 GPUs will just hang after feeding input to the model for inferencing.