caffe
caffe copied to clipboard
Training stuck at iteration 0
I'm trying to train the models on Pascal with two 1080ti and I've changed the 'gpus' to "0,1". However, the training stuck at iteration 0 while the gpu usage on one gpu is 100% and the other is 0%. How can I solve this issue? Thanks.
It is loading an exisitng model snapshot and doing evaluation.
But this will happen even if I start from scratch and it will get stuck forever as long as I use two Gpus. The log looks like this:
I0610 06:55:34.222574 887 solver.cpp:75] Solver scaffolding done.
I0610 06:55:34.227442 887 caffe.cpp:155] Finetuning from models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel
I0610 06:55:34.257673 887 upgrade_proto.cpp:67] Attempting to upgrade input file specified using deprecated input fields: models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel
I0610 06:55:34.257701 887 upgrade_proto.cpp:70] Successfully upgraded file specified using deprecated input fields.
W0610 06:55:34.257705 887 upgrade_proto.cpp:72] Note that future Caffe releases will only support input layers and not input fields.
I0610 06:55:34.269161 887 net.cpp:761] Ignoring source layer drop6
I0610 06:55:34.269793 887 net.cpp:761] Ignoring source layer drop7
I0610 06:55:34.269801 887 net.cpp:761] Ignoring source layer fc8
I0610 06:55:34.269804 887 net.cpp:761] Ignoring source layer prob
I0610 06:55:34.300992 887 upgrade_proto.cpp:67] Attempting to upgrade input file specified using deprecated input fields: models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel
I0610 06:55:34.301017 887 upgrade_proto.cpp:70] Successfully upgraded file specified using deprecated input fields.
W0610 06:55:34.301020 887 upgrade_proto.cpp:72] Note that future Caffe releases will only support input layers and not input fields.
I0610 06:55:34.312408 887 net.cpp:761] Ignoring source layer drop6
I0610 06:55:34.313042 887 net.cpp:761] Ignoring source layer drop7
I0610 06:55:34.313051 887 net.cpp:761] Ignoring source layer fc8
I0610 06:55:34.313053 887 net.cpp:761] Ignoring source layer prob
I0610 06:55:34.332238 887 parallel.cpp:392] GPUs pairs 0:1
I0610 06:55:34.524313 887 annotated_data_layer.cpp:62] output data size: 16,3,300,300
I0610 06:55:35.410040 887 parallel.cpp:425] Starting Optimization
I0610 06:55:35.410105 887 solver.cpp:294] Solving VGG_coco_SSD_300x300_train
I0610 06:55:35.410150 887 solver.cpp:295] Learning Rate Policy: multistep
It will actually stop here and terminating the training with 'ctrl-c' will not kill the caffe process as usual. I will have to manually kill the process in order to use the 100% utilized GPU.
reporting the same issue. running on 4 P100 gpus, ubuntu 16.04, this also happens to me exactly as reported.
I meet this issue,and don't know how to solve it.Can you give some help? @pkdogcom @weiliu89 @ghostcow
try my fork: https://github.com/ghostcow/caffe