Dynamic-Network-Surgery-Caffe-Reimplementation
Dynamic-Network-Surgery-Caffe-Reimplementation copied to clipboard
Caffe re-implementation of dynamic network surgery.
Dynamic-Network-Surgery-Caffe-Reimplementation
Caffe reimplementation of dynamic network surgery(GPU-only/cuDNN unsupported yet).
Official repo at here (https://github.com/yiwenguo/Dynamic-Network-Surgery).
Main Differences:
- We
didn't
prune the bias term. - We make the selection of hyper-parameters more clear and intuitive.
- We re-adjust the organization of the codes.
You maymonitor the change of weights sparsity
of convolution layers and fully-connected layers during training. - We re-write the original convolution layer and inner-product layer instead of creating new classes.
It will be easier to reuse the existing.prototxt
without modifying the layer types.
How to use ?
The sames as the original Caffe framework.
$ make all -j8 # USE_NCCL=1 make all -j8 for multi-GPU support
$ ./build/tools/caffe train --weights /ModelPath/Ur.caffemodel --solver /SolverPath/solver.prototxt -gpu 0
$ ./build/tools/caffe test --weights /ModelPath/Ur.caffemodel --model /StructPath/train_val.prototxt -gpu 0 -iterations 100
# Please notice:
# CPU Version is not supported yet, but you may find it quite easy to rewrite conv_layer.cpp and innerproduct_layer.cpp from .cu files.
You may load pre-trained caffemodel into this framework to fine-tune (highly recommended) or re-train from the begining (remember to set the threshold
in train_val.prototxt
, which will be mentioned below).
Usage Example :
Pre-trained Caffemodel:
AlexNet with BN (https://github.com/HolmesShuan/AlexNet-BN-Caffemodel-on-ImageNet)
Sparse (50%) convolution layers should outperform
the full-precison baseline.
Pruned Layer
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
# weight_mask param
param {
lr_mult: 0
decay_mult: 0
}
convolution_param {
num_output: 96
kernel_size: 11
stride: 4
pad: 2
threshold: 0.6 ## based on the 68-95-99.7 rule [defalut 0.6]
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
# weight_mask_filler { ## omissible
# type: "constant"
# value: 1 ## This term has been reset to 1 in caffe.proto
# }
}
}
Dense Layer
layer {
name: "fc6"
type: "InnerProduct"
bottom: "pool5"
top: "fc6"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
num_output: 4096
sparsity_term: false ## Default is true
weight_filler {
type: "gaussian"
std: 0.005
}
bias_filler {
type: "constant"
value: 0.1
}
}
}
solver.prototxt
net: "models/bvlc_alexnet/train_val.prototxt"
base_lr: 0.001
lr_policy: "multistep"
gamma: 0.1
stepvalue: 84000
display: 20
max_iter: 162000
momentum: 0.9
weight_decay: 0.00005
snapshot: 6000
snapshot_prefix: "models/bvlc_alexnet/alexnet-BN"
solver_mode: GPU
surgery_iter_gamma: 0.0001 ## [default 1e-4] Probability(do surgery) = (1+gamma*iter)^-power
surgery_iter_power: 1 ## [default 1]
Tips
- The selection of threshold is pretty tricky. It may differ a lot between different layers.
- If you encounter the vanishing gradient problem, then adjust
gamma
andpower
insolver.prototxt
.
If multiple attempts failed, then you may reduce thethreshold
based on the 68-95-99.7 rule.
Threshold | Sparsity |
---|---|
0.674 | 50% |
0.994 | 68% |
1.281 | 80% |
1.644 | 90% |
1.959 | 95% |
Citation
Basic idea comes from:
@inproceedings{guo2016dynamic,
title = {Dynamic Network Surgery for Efficient DNNs},
author = {Guo, Yiwen and Yao, Anbang and Chen, Yurong},
booktitle = {Advances in neural information processing systems (NIPS)},
year = {2016}
}
And base on Caffe framework:
@article{jia2014caffe,
Author = {Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor},
Journal = {arXiv preprint arXiv:1408.5093},
Title = {Caffe: Convolutional Architecture for Fast Feature Embedding},
Year = {2014}
}