MobileNet-YOLO icon indicating copy to clipboard operation
MobileNet-YOLO copied to clipboard

multi gpu still stuck

Open Amanda-Barbara opened this issue 5 years ago • 11 comments

hi, I have tried your newst version of MobileNet-YOLO to train with multi gpu, but the gpus still seized up and stopped the step like this: I0724 04:24:32.355298 8003 solver.cpp:203] Creating test net (#0) specified by test_net file: models/yolov3/head_mobilenet_yolov3_lite_test.prototxt can you give any idea? thanks @eric612

Amanda-Barbara avatar Jul 24 '19 08:07 Amanda-Barbara

Try to change the batch size in test prototxt

eric612 avatar Aug 03 '19 02:08 eric612

I meet the same problem. Could you tell me the size you changed?

solomon-ma avatar Sep 27 '19 15:09 solomon-ma

I think the batch size = 1 can't not be split to multi-gpu training in test phase , so you can close the test phase and start training .

eric612 avatar Oct 03 '19 05:10 eric612

I tries the batch size = 1, but it also stuck.

I find the situation that your project is forked from caffe-ssd which is also stuck in multi-gpu. But I tried the caffe source code from BVLC, it could be run using multi-gpu with NCCL. And I tried the caffe writted by yjxiong, which is wrote with openmpi to do the multi-gpu work.

I'll try to use the BVLC code to rewrite the caffe-mobilenet-yolo. Could you help me if I have some problems?

solomon-ma avatar Oct 03 '19 05:10 solomon-ma

Unfortunately , I don't have multi-gpu computer or environment :(

So , it is really hard for me , maybe you can see this issue https://github.com/eric612/MobileNet-YOLO/issues/28

eric612 avatar Oct 03 '19 05:10 eric612

@solomon-ma @Amanda-Barbara , I also encountered the same problem, I changed the batch_size to 4 (the same number as my gpus), still stopped at "Creating test net (#0) specified by test_net file"; Have you solved this problem? If you can solve it, can you tell me?

TccccD avatar Oct 14 '19 08:10 TccccD

Hi Guys, I also meet the same problem, even I use the example ./build/tools/caffe train --solver=examples/mnist/lenet_solver.prototxt --gpu 0,1

jerryho-quanta avatar Oct 18 '19 02:10 jerryho-quanta

I think the batch size = 1 can't not be split to multi-gpu training in test phase , so you can close the test phase and start training .

Thanks for your great work! Yes, you are right, training on multi gpus is working after closing testing phase but still confused , why it is stopped in testing phase, even I set the testing batch size 4, (I am using 2 gpus)

RamatovInomjon avatar Dec 04 '19 02:12 RamatovInomjon

Refer this issue https://github.com/eric612/MobileNet-YOLO/issues/198

eric612 avatar Dec 04 '19 02:12 eric612

I think the batch size = 1 can't not be split to multi-gpu training in test phase , so you can close the test phase and start training .

Thanks for your great work! Yes, you are right, training on multi gpus is working after closing testing phase but still confused , why it is stopped in testing phase, even I set the testing batch size 4, (I am using 2 gpus)

hi,could you tell me the close test phase step? thanks.

guagua11 avatar Feb 06 '20 00:02 guagua11

@guagua11 Remove https://github.com/eric612/MobileNet-YOLO/blob/master/models/mobilenetv2_voc/yolo_lite/solver.prototxt#L2-L4

eric612 avatar Feb 06 '20 01:02 eric612