caffe icon indicating copy to clipboard operation
caffe copied to clipboard

Training stuck at iteration 0

Open pkdogcom opened this issue 7 years ago • 5 comments

I'm trying to train the models on Pascal with two 1080ti and I've changed the 'gpus' to "0,1". However, the training stuck at iteration 0 while the gpu usage on one gpu is 100% and the other is 0%. How can I solve this issue? Thanks.

pkdogcom avatar May 30 '17 02:05 pkdogcom

It is loading an exisitng model snapshot and doing evaluation.

weiliu89 avatar Jun 08 '17 17:06 weiliu89

But this will happen even if I start from scratch and it will get stuck forever as long as I use two Gpus. The log looks like this:

I0610 06:55:34.222574   887 solver.cpp:75] Solver scaffolding done.
I0610 06:55:34.227442   887 caffe.cpp:155] Finetuning from models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel
I0610 06:55:34.257673   887 upgrade_proto.cpp:67] Attempting to upgrade input file specified using deprecated input fields: models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel
I0610 06:55:34.257701   887 upgrade_proto.cpp:70] Successfully upgraded file specified using deprecated input fields.
W0610 06:55:34.257705   887 upgrade_proto.cpp:72] Note that future Caffe releases will only support input layers and not input fields.
I0610 06:55:34.269161   887 net.cpp:761] Ignoring source layer drop6
I0610 06:55:34.269793   887 net.cpp:761] Ignoring source layer drop7
I0610 06:55:34.269801   887 net.cpp:761] Ignoring source layer fc8
I0610 06:55:34.269804   887 net.cpp:761] Ignoring source layer prob
I0610 06:55:34.300992   887 upgrade_proto.cpp:67] Attempting to upgrade input file specified using deprecated input fields: models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel
I0610 06:55:34.301017   887 upgrade_proto.cpp:70] Successfully upgraded file specified using deprecated input fields.
W0610 06:55:34.301020   887 upgrade_proto.cpp:72] Note that future Caffe releases will only support input layers and not input fields.
I0610 06:55:34.312408   887 net.cpp:761] Ignoring source layer drop6
I0610 06:55:34.313042   887 net.cpp:761] Ignoring source layer drop7
I0610 06:55:34.313051   887 net.cpp:761] Ignoring source layer fc8
I0610 06:55:34.313053   887 net.cpp:761] Ignoring source layer prob
I0610 06:55:34.332238   887 parallel.cpp:392] GPUs pairs 0:1
I0610 06:55:34.524313   887 annotated_data_layer.cpp:62] output data size: 16,3,300,300
I0610 06:55:35.410040   887 parallel.cpp:425] Starting Optimization
I0610 06:55:35.410105   887 solver.cpp:294] Solving VGG_coco_SSD_300x300_train
I0610 06:55:35.410150   887 solver.cpp:295] Learning Rate Policy: multistep

It will actually stop here and terminating the training with 'ctrl-c' will not kill the caffe process as usual. I will have to manually kill the process in order to use the 100% utilized GPU.

pkdogcom avatar Jun 10 '17 06:06 pkdogcom

reporting the same issue. running on 4 P100 gpus, ubuntu 16.04, this also happens to me exactly as reported.

ghostcow avatar Aug 15 '17 17:08 ghostcow

I meet this issue,and don't know how to solve it.Can you give some help? @pkdogcom @weiliu89 @ghostcow

ycjcy avatar Sep 23 '19 08:09 ycjcy

try my fork: https://github.com/ghostcow/caffe

ghostcow avatar Sep 23 '19 09:09 ghostcow