llama icon indicating copy to clipboard operation
llama copied to clipboard

Can not run 13B inference model. After loading the ckpt, it just stoped and the gpus are still occupied.

Open hzy312 opened this issue 1 year ago • 7 comments

Can not run 13B inference model. After loading the ckpt, it just stoped and the gpus are still occupied.

image

hzy312 avatar Mar 03 '23 06:03 hzy312

Sounds like it is indeed running the inference model. It will hang there till it actually completes its inference. If the GPU's are occupied that sounds like its doing its work? How long have you waited?

Urammar avatar Mar 03 '23 07:03 Urammar

@hzy312 I am having the same problem. The script gets stuck on feeding the input into the first layer. Did you find a solution to this problem?

LucWeber avatar Mar 05 '23 17:03 LucWeber

Sounds like it is indeed running the inference model. It will hang there till it actually completes its inference. If the GPU's are occupied that sounds like its doing its work? How long have you waited?

many minutes and the program will exit

hzy312 avatar Mar 06 '23 07:03 hzy312

Same problem!

andrewmlu avatar Mar 06 '23 22:03 andrewmlu

Seems a potential reason is not having enough GPU? Even though 2x RTX 4090 and RTX A5000 each with 24GB should be enough, I was able to run only on A6000 with 48GB.

andrewmlu avatar Mar 07 '23 04:03 andrewmlu

@andrewmlu I don't think it has to do with the amount of vRAM, but something else related to the type of GPU (maybe the drivers?). For me, the problem was resolved after switching from 2x Nvidia A30 (24G) to smaller Tesla T4s (16G). image image

LucWeber avatar Mar 07 '23 08:03 LucWeber

I have solved the issue (for me). As far as I understand, has to do p2p communication of the GPUs via NCCL. One can deactivate p2p via setting environmental variables like so:

NCCL_P2P_DISABLE='1' NCCL_IB_DISABLE='1' torchrun --nproc_per_node 4 example.py [...]

Note that this will make the GPUs communicate via shared memory, you will therefore have to allocate (a lot) more RAM than in the standard setup. Also, inter-GPU communication might be not as fast as before.

References: https://github.com/open-mmlab/mmdetection/issues/6534 https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

LucWeber avatar Mar 10 '23 15:03 LucWeber

I am having the same problem. Using 2 or 4 GPUs will just hang after feeding input to the model for inferencing.

chunhualiao avatar Sep 03 '23 01:09 chunhualiao