How to start multi-gpus training in a single machine

Open kevinhuangxf opened this issue 11 months ago • 1 comments

Thanks for the excellent work!

I encounter a problem of how to start multi-gpu training. I have 8 gpus but each I ran the training command line I can only start one GPU training:

I use this command:

python -m src.main +experiment=re10k data_loader.train.batch_size=14

Does it mean even I train on single node with multiple GPUs, I still need to use slurm to run multi gpus training?

Jan 26 '25 08:01 kevinhuangxf

Hi @kevinhuangxf, thanks for your appreciation. Normally, the current setting should automatically utilize all available GPUs for training. I'm not sure what might be causing this issue. You could try explicitly specifying the training devices to use all GPUs by following the instructions here.

Feb 18 '25 02:02 donydchen