YOLOv6
YOLOv6 copied to clipboard
Multi GPU training problem
when I use multi gpu for training, I was always stuck here,please help me! Messages are as follows: training args are: Namespace(batch_size=4, check_images=False, check_labels=False, conf_file='configs/yolov6_tiny_head_det.py', data_path='data/head_det.yaml', device='4,5, 6,7', dist_url='tcp://127.0.0.1:8888', epochs=400, gpu_count=0, img_size=640, local_rank=0, name='exp', noval=False, output_dir='./runs/train', rank=0, workers=8, world_siz e=4)
Using 4 GPU for training... Initializing process group...
finally, for a long time, it does not go on
Thanks for your attention. We will try our best to solve your problem, but more concrete information is necessary for reproducing your problem.The problem you mentioned has a lot to do with your hardware, so you need to provide more specific error information
I try to encounter the same problem, I can execute it with a single gpu.
@GuanLinHu I am the same, I can train with single gpu, but when I use multi-gpu, I meet the problem above!
@Mobu59 I tried to use the method that the author of yolov5 said is possible, but the method is not good.
@Mobu59 @GuanLinHu Sorry I haven't reproduced your problem yet, but if there is no problem with the data, it may be caused by multi-threading and deadlock: maybe you can try:OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python train.py or --workers =0,to test the problem still exist.
What's wrong with my problem, I can't use multi-GPU training, I added a NCCL_P2P_LEVEL=0 before python train.py.
@Mobu59 @GuanLinHu抱歉我还没有重现你的问题,但是如果数据没有问题,可能是多线程和死锁引起的: 或许你可以试试:OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python train.py 或 --workers =0,测试问题依然存在。
我也遇到了这个问题, 将tcp:// 改为"env://" 之后就可以正常训练了。
应该是torch版本的问题, 我用的是1.9.1的版本
I had the problem of not being able to train the model on multiple gpu on window. Has anyone trained the model on multiple gpu?