YOLOv6 icon indicating copy to clipboard operation
YOLOv6 copied to clipboard

多机多卡问题

Open FL77N opened this issue 2 years ago • 5 comments

你好,我在多机多卡上训练时,出现了以下问题,想请教一下是不是有什么地方需要改动: RuntimeError: Address already in use


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


FL77N avatar Jul 14 '22 11:07 FL77N

add --master_port 30001 or other value in start command, for example:

python -m torch.distributed.launch --nproc_per_node 8 --master_port 30002  tools/train.py ...

mtjhl avatar Jul 14 '22 12:07 mtjhl

add --master_port 30001 or other value in start command, for example:

python -m torch.distributed.launch --nproc_per_node 8 --master_port 30002  tools/train.py ...

I have tried this, but met the same problem.

FL77N avatar Jul 14 '22 13:07 FL77N

I can train it with single gpu.

FL77N avatar Jul 14 '22 13:07 FL77N

add --master_port 30001 or other value in start command, for example:

python -m torch.distributed.launch --nproc_per_node 8 --master_port 30002  tools/train.py ...

我的环境是 troch 1.8.1+cuda90.cudnn7.6.5 python 3.6 这会有影响吗

FL77N avatar Jul 15 '22 01:07 FL77N

add --master_port 30001 or other value in start command, for example:

python -m torch.distributed.launch --nproc_per_node 8 --master_port 30002  tools/train.py ...

我的环境是 troch 1.8.1+cuda90.cudnn7.6.5 python 3.6 这会有影响吗

一般没影响,可以看下nvidia-smi以及完整的错误截图吗?

shensheng272 avatar Jul 29 '22 04:07 shensheng272