YOLOv6 icon indicating copy to clipboard operation
YOLOv6 copied to clipboard

Multi GPU training problem

Open Mobu59 opened this issue 2 years ago • 9 comments

when I use multi gpu for training, I was always stuck here,please help me! Messages are as follows: training args are: Namespace(batch_size=4, check_images=False, check_labels=False, conf_file='configs/yolov6_tiny_head_det.py', data_path='data/head_det.yaml', device='4,5, 6,7', dist_url='tcp://127.0.0.1:8888', epochs=400, gpu_count=0, img_size=640, local_rank=0, name='exp', noval=False, output_dir='./runs/train', rank=0, workers=8, world_siz e=4)

Using 4 GPU for training... Initializing process group...

finally, for a long time, it does not go on

Mobu59 avatar Jun 30 '22 07:06 Mobu59

Thanks for your attention. We will try our best to solve your problem, but more concrete information is necessary for reproducing your problem.The problem you mentioned has a lot to do with your hardware, so you need to provide more specific error information

meituan-gengyifei avatar Jun 30 '22 07:06 meituan-gengyifei

I try to encounter the same problem, I can execute it with a single gpu. image image

Guan-LinHe avatar Jun 30 '22 07:06 Guan-LinHe

@GuanLinHu I am the same, I can train with single gpu, but when I use multi-gpu, I meet the problem above!

Mobu59 avatar Jun 30 '22 07:06 Mobu59

@Mobu59 I tried to use the method that the author of yolov5 said is possible, but the method is not good. image

Guan-LinHe avatar Jun 30 '22 07:06 Guan-LinHe

@Mobu59 @GuanLinHu Sorry I haven't reproduced your problem yet, but if there is no problem with the data, it may be caused by multi-threading and deadlock: maybe you can try:OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python train.py or --workers =0,to test the problem still exist.

meituan-gengyifei avatar Jun 30 '22 08:06 meituan-gengyifei

image What's wrong with my problem, I can't use multi-GPU training, I added a NCCL_P2P_LEVEL=0 before python train.py.

RooKichenn avatar Jun 30 '22 10:06 RooKichenn

@Mobu59 @GuanLinHu抱歉我还没有重现你的问题,但是如果数据没有问题,可能是多线程和死锁引起的: 或许你可以试试:OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python train.py 或 --workers =0,测试问题依然存在。

我也遇到了这个问题, 将tcp:// 改为"env://" 之后就可以正常训练了。

Caius-Lu avatar Jul 13 '22 07:07 Caius-Lu

应该是torch版本的问题, 我用的是1.9.1的版本

Caius-Lu avatar Jul 13 '22 07:07 Caius-Lu

I had the problem of not being able to train the model on multiple gpu on window. Has anyone trained the model on multiple gpu?

anhuong98 avatar Jul 21 '22 13:07 anhuong98