nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

16 GPU per node

Open spcrobocar opened this issue 1 year ago • 4 comments

Hi, my system has 16 GPUs per node. However, if I run torchrun --standalone --nproc_per_node=16 train.py config/train_gpt2.py The training crashed. How can I use 16 GPUs?

spcrobocar avatar Jan 16 '24 04:01 spcrobocar

16 gpus per node? wont you have 2 nodes of 8xGPU? Also what GPUs

VatsaDev avatar Jan 17 '24 19:01 VatsaDev

@spcrobocar you have old mining rig with 16 gpu connected with 1x pcie express, right?

a0s avatar Jan 18 '24 21:01 a0s

If your case matched what @a0s expected, then you should pair the node for some set of GPUs. If that doesn't work, hardware capacity could be lacking.

sofyanox12 avatar Jan 21 '24 14:01 sofyanox12

are you able to use 2 GPUs?

WeileiZeng avatar Jun 13 '24 02:06 WeileiZeng