fast-rcnn icon indicating copy to clipboard operation
fast-rcnn copied to clipboard

How to use new snapshotting?

Open xksteven opened this issue 9 years ago • 16 comments

fast-rcnn doesn't take as an argument --snapshot so I'm not sure how to use a snapshot.

I'm asking because in the /models/VGG16/solver.prototxt it has this : "We disable standard caffe solver snapshotting and implement our own snapshot"

Thanks

xksteven avatar Jul 10 '15 16:07 xksteven

It's in the ./lib/fast-rcnn/config.py

WilsonWangTHU avatar Aug 05 '15 13:08 WilsonWangTHU

In that file I can change the time between snapshots and snapshot infix but nothing on using the snapshot during training.

Would I just change in the solver.prototxt the snapshot number to reference the current snapshot?

xksteven avatar Aug 06 '15 05:08 xksteven

@xksteven I guess you would like to do the validation during the training? I am not sure whether that's supported by the current fast-rcnn edition, as all the forward job is started from the python part and I don't think we have a testing function during the training for now. I am afraid in that way you might need to revise the code yourself.

WilsonWangTHU avatar Aug 07 '15 11:08 WilsonWangTHU

@WilsonWangTHU You know when using caffe you can provide the snapshot option such as -snapshot=model_iter_xxx.solverstate to restart the training from that point? Normally in caffe the solverstate and the caffemodel saved as model_iter_xxx.caffemodel are both in the same directory but with fast-rcnn I only see the caffemodel saved in the output/default/imdb_trainval. I'd like to be able to restart the training using those weights stored there.

I'm running it on a cluster with a certain time limit and it will kill my process at certain time intervals. I just want to be able to restart the training from that snapshot.

xksteven avatar Aug 07 '15 15:08 xksteven

I have the same problem.

kyuusaku avatar Sep 23 '15 01:09 kyuusaku

How to restart the training from a snapshot? Can anyone provide some tips? Thanks.

kyuusaku avatar Sep 23 '15 01:09 kyuusaku

@kyuusaku @xksteven I have met the same problem, do you guys get some effective solutions?Thanks

IdiosyncraticDragon avatar Oct 14 '15 07:10 IdiosyncraticDragon

Make the following modifications and you will be able to use the --snapshot argument

In tools/train_net.py

    def parse_args():
        """
        Parse input arguments
        """
        parser = argparse.ArgumentParser(description='Train a Fast R-CNN network')
        parser.add_argument('--gpu', dest='gpu_id',
                            help='GPU device id to use [0]',
                            default=0, type=int)
        parser.add_argument('--solver', dest='solver',
                            help='solver prototxt',
                            default=None, type=str)
        parser.add_argument('--iters', dest='max_iters',
                            help='number of iterations to train',
                            default=40000, type=int)
        parser.add_argument('--weights', dest='pretrained_model',
                            help='initialize with pretrained model weights',
                            default=None, type=str)
        parser.add_argument('--snapshot', dest='previous_state',
                            help='initialize with previous state',
                            default=None, type=str) 
        parser.add_argument('--cfg', dest='cfg_file',
                            help='optional config file',
                            default=None, type=str)
        parser.add_argument('--imdb', dest='imdb_name',
                            help='dataset to train on',
                            default='voc_2007_trainval', type=str)
        parser.add_argument('--rand', dest='randomize',
                            help='randomize (do not use a fixed seed)',
                            action='store_true')
        parser.add_argument('--set', dest='set_cfgs',
                            help='set config keys', default=None,
                            nargs=argparse.REMAINDER)

In lib/fast_rcnn/train.py

        class SolverWrapper(object):
            """A simple wrapper around Caffe's solver.
            This wrapper gives us control over he snapshotting process, which we
            use to unnormalize the learned bounding-box regression weights.
            """
            def __init__(self, solver_prototxt, roidb, output_dir,
                         pretrained_model=None, previous_state=None):
                """Initialize the SolverWrapper."""
                self.output_dir = output_dir
                print 'Computing bounding-box regression targets...'
                self.bbox_means, self.bbox_stds = \
                        rdl_roidb.add_bbox_regression_targets(roidb)
                print 'done'
                self.solver = caffe.SGDSolver(solver_prototxt)
                if pretrained_model is not None:
                    print ('Loading pretrained model '
                           'weights from {:s}').format(pretrained_model)
                    self.solver.net.copy_from(pretrained_model)
                 elif previous_state is not None:
                    print ('Restoring State from '
                              ' from {:s}').format(previous_state)
                    self.solver.restore(previous_state)
                self.solver_param = caffe_pb2.SolverParameter()
                with open(solver_prototxt, 'rt') as f:
                    pb2.text_format.Merge(f.read(), self.solver_param)
                self.solver.net.layers[0].set_roidb(roidb)
.
.
.
def train_net(solver_prototxt, roidb, output_dir,
              pretrained_model=None, max_iters=40000,previous_state=None):
    """Train a Fast R-CNN network."""
    sw = SolverWrapper(solver_prototxt, roidb, output_dir,
                       pretrained_model=pretrained_model,previous_state=previous_state)
    print 'Solving...'
    sw.train_model(max_iters)
    print 'done solving'

lynetcha avatar Nov 15 '15 19:11 lynetcha

Thanks for the code but how to save the solverstate during fast r-cnn training? It looks like the method Solver::SnapshotSolverState isn't exported to pycaffe...

chrert avatar Jan 19 '16 11:01 chrert

Did you change "snapshot: 0" to "snapshot: 10000" in your solver.prototxt? That allows you to save the state at iteration 10000 for example.

lynetcha avatar Jan 21 '16 02:01 lynetcha

Ah, thanks! Didn't think of that...

chrert avatar Jan 21 '16 14:01 chrert

@lynetcha, one more modification:

In tools/train_net.py

    output_dir = get_output_dir(imdb)
    print 'Output will be saved to `{:s}`'.format(output_dir)

    train_net(args.solver, roidb, output_dir,
              pretrained_model=args.pretrained_model,
              max_iters=args.max_iters, **previous_state=args.previous_state**)

also remember to omit --weights param

smichalowski avatar Mar 11 '16 23:03 smichalowski

hi @po0ya

what if I don't save the extra file for the last layer weights? would be bad mAP after retraining?

twmht avatar Aug 27 '16 16:08 twmht

Hello @twmht

Basically it'll mess up the whole network if you want to continue training. The network is trained to work for zero mean and unit variance bboxes. For test time convenience, the weights and bias of the last layer is scaled by the std and shifted by the mean. If it has not been done, the prediction should've been scaled and shifted manually. It's for convenience in testing time, but the weights are not the ones that were learned by backprop, so retraining with these weights would be meaningless for the network.

EDIT: Add these couple of lines to the end of SolverWrapper constructor init

        found = False
        for k in net.params.keys():
            if 'bbox_pred' in k:
                bbox_pred = k
                found = True
            print('[#] Renormalizing the final layers back')
            net.params[bbox_pred][0].data[4:, :] = \
                (net.params[bbox_pred][0].data[4:, :] *
                 1.0 / self.bbox_stds[4:, np.newaxis])
            net.params[bbox_pred][1].data[4:] = \
                    (net.params[bbox_pred][1].data - self.bbox_means)[4:] * 1.0 / self.bbox_stds[4:]
        if not found:
            print('Warning layer \"bbox_pred\" not found')

po0ya avatar Aug 29 '16 15:08 po0ya

@po0ya but aren't the weights (*.caffemodel) that are saved by the default solver already normalized (because they were never unnormalizied, because the caffemodel was not saved using provided snapshot functionality). So I guess the produced *.solverstate is linked to the *.caffemodel model that was not produced by the faster rcnn snapshot function. Using resuming functionality you get 2 versions of caffemodel, the one provided by the default solver snapshot and the one provided by the snapshot function in faster r-cnn that the weights are unnormalized before saving. So I guess that normalization is not needed.

ds2268 avatar Nov 17 '17 07:11 ds2268

Net params in snapshot function in SolverWrapper is first unnormalized, saved and restored with normalized version. So the param version is up to when the snapshot in Caffe is called.

I didn't dig the code of Caffe, but I think disabling snapshot in solver.prototxt and manually calling solver.snapshot() will be better to control exactly which version is snapshotted.

Actually, I look into the log and found that the Caffe snapshot is called before snapshot in SolverWrapper. diff the params file shows that Caffe snapshot indeed save a different (normalized) version than SolverWrapper. Manually invocation of solver.snapshop obtained a identical .caffemodel.

So we can resume the .solverstate safely without unnormalizing the parameters with Caffe snapshot. But this produces two version of '.caffemodel's. It's up to you to snapshot which version of parameters.

misssprite avatar May 23 '18 10:05 misssprite