llama Can not run 13B inference model. After loading the ckpt, it just stoped and the gpus are still occupied.

Can not run 13B inference model. After loading the ckpt, it just stoped and the gpus are still occupied.

Open hzy312 opened this issue 1 year ago • 7 comments

Can not run 13B inference model. After loading the ckpt, it just stoped and the gpus are still occupied.

Mar 03 '23 06:03 hzy312

Sounds like it is indeed running the inference model. It will hang there till it actually completes its inference. If the GPU's are occupied that sounds like its doing its work? How long have you waited?

Mar 03 '23 07:03 Urammar

@hzy312 I am having the same problem. The script gets stuck on feeding the input into the first layer. Did you find a solution to this problem?

Mar 05 '23 17:03 LucWeber

Sounds like it is indeed running the inference model. It will hang there till it actually completes its inference. If the GPU's are occupied that sounds like its doing its work? How long have you waited?

many minutes and the program will exit

Mar 06 '23 07:03 hzy312

Same problem!

Mar 06 '23 22:03 andrewmlu

Seems a potential reason is not having enough GPU? Even though 2x RTX 4090 and RTX A5000 each with 24GB should be enough, I was able to run only on A6000 with 48GB.

Mar 07 '23 04:03 andrewmlu

@andrewmlu I don't think it has to do with the amount of vRAM, but something else related to the type of GPU (maybe the drivers?). For me, the problem was resolved after switching from 2x Nvidia A30 (24G) to smaller Tesla T4s (16G).

Mar 07 '23 08:03 LucWeber

I have solved the issue (for me). As far as I understand, has to do p2p communication of the GPUs via NCCL. One can deactivate p2p via setting environmental variables like so:

NCCL_P2P_DISABLE='1' NCCL_IB_DISABLE='1' torchrun --nproc_per_node 4 example.py [...]

Note that this will make the GPUs communicate via shared memory, you will therefore have to allocate (a lot) more RAM than in the standard setup. Also, inter-GPU communication might be not as fast as before.

References: https://github.com/open-mmlab/mmdetection/issues/6534 https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

Mar 10 '23 15:03 LucWeber

I am having the same problem. Using 2 or 4 GPUs will just hang after feeding input to the model for inferencing.

Sep 03 '23 01:09 chunhualiao

llama llama copied to clipboard

Can not run 13B inference model. After loading the ckpt, it just stoped and the gpus are still occupied.

llama
llama copied to clipboard