Incremental-Network-Quantization icon indicating copy to clipboard operation
Incremental-Network-Quantization copied to clipboard

Cannot replicate the results of the INQ paper + Suggestion to optimize the speed of quantization

Open vassiliad opened this issue 6 years ago • 5 comments

First of all I would like to thank you for sharing your code, I deeply appreciate the fact that you shared the implementation of your algorithm.

Unfortunately, I have run into some issues when trying to replicate the results of your paper. I think that I have implemented all of the fine details that you used in your implementation of INQ, but I cannot get the expected level of classification performance.

Would you mind having a look at the steps that I follow to replicate your methodology? Notice that where appropriate I write the exact values that I use for the case of Alexnet since this is the example that you included with your code.

  • Partition the parameters of the model into "quantizable" and "non-quantizable"
    • Quantizable parameters are the weights, non-quantizable ones are the bias terms and batch-norm parameters
  • Decide the accumulated portions of quantized campaign, (e.g. on Alexnet we'd follow the partition scheme campaign=[0.3, 0.6, 0.8, 1.0].)
    • These values determine the portion of the weights which shall be quantized and subsequently "frozen".
  • Setup the solver parameters. (e.g for Alexnet we set the following):
    • base learning rate = 0.003
    • batch_size = 256
    • weight_decay = 0.0005
    • momentum equal to 0.9
    • step_gamma = 0.2
    • step_size = 15000
      • in the case of Alexnet due to its batch_size of 256 (~5004 steps per epoch) we will reduce the learning rate ONLY ONCE. Specifically, during the very last epoch within which we finetune just the bias terms since all of the weights are in their frozen and quantized form
        • Is this in line with Step 3 of your algorithm as stated in your paper? (it reads "Reset the base learning rate and the learning policy")
  • At the beginning of each epoch of the campaign I calculate the maximum element of each weight matrix to compute the quantization offset (n1 of blob.cpp). Subsequently, I quantize the appropriate top X% of the weights. These quantized values are then frozen for the entirety of the finetuning for this epoch.
    • in the case of Alexnet X can be one of the following [30, 60, 80, 100] following the definition of the campaign quantization scheme. For this step I use your algorithm from blob.cpp.
    • You may speed up the quantization process if you use ldexp(1, i) instead of pow(2, i). Ldexp is much faster (at least on my machine/compiler configuration) and functionally equivalent for this particular case.
    • Similar to your code, I assume that the maximum value does not change. Thus the corresponding n1 offset remains the same too, as such I do not need to store n1 and I expect that the frozen parts of the weights will get quantized in the exact same way in all subsequent epochs of the campaign.
    • When an epoch terminates I save the state of the solver to keep track of whichever information it used for the momentum. I expect that this is what takes place when restoring the checkpoint from disk.
    • Finally, I noticed that in your https://github.com/Zhouaojun/Incremental-Network-Quantization/blob/master/examples/INQ/alexnet/solver.prototxt file you specify that the max_iter is 63000. In the steps listed above I assumed that you don't run each phase of the campaign for 63k mini-batch updates but Caffe stops each such phase after exactly 1 epoch (in the case of Alexnet that is about 5004 mini-batch updates, based on the size of the training set of ImageNet and the batch-size that you have picked).
  • During finetuning I shuffle the input data using a shuffling buffer of 5000 images. The original list of images/labels is also shuffled.

After finetuning for 4 epochs (for 20019 mini-batch updates in total) following the campaign scheme I notice that the quantized version of my Alexnet model has worse classification performance in comparison to the original one. The top5/top1 errors of the quantized model are about 2-3% higher.

Have I missed something in your algorithm or is this just a case of finding the optimal solver hyperparameters and campaign partition scheme? I have to comment that my Alexnet model is slightly different in terms of original validation accuracy on Imagenet than the one that you used in your paper but it has the exact same architecture.

vassiliad avatar May 30 '18 07:05 vassiliad

@vassiliad thanks for your comments and suggestion. For first step, what about the max iteration you use, and the max iteration should make sure the learning rate decay to e^-5, can you share the first re-trained accuracy?

AojunZhou avatar Jun 01 '18 06:06 AojunZhou

Hi @Zhouaojun, thank you for helping me out with this issue.

I included the information that you request in my earlier post, but I will also write it here in more concise form. So, for each step of the process I finetune for 5004 iterations (in my initial post a mini-batch update terms is a synonym for what Caffe defines as an iteration). To this end it takes 5004 iterations to process one full epoch for Imagenet because 256*5004 ~= 1281167. Recall that, 1281167 is the number of images within the Imagenet training data set. Because 256 does not evenly divide 1281167 the very last iteration (the 5004th) involves fewer than 256 images.

So in total I finetune for 20016 iterations to match the 4 epochs that is specified in the INQ paper. Are your intentions to finetune for 63000 iterations for every step of the process, which brings the number of total iterations to 4*63000=252000 (i.e. 48 epochs).

Finally, I do apply L2 regularization with0.0005 weight-decay following the information in examples/INQ/alexnet/solver.prototxt.

vassiliad avatar Jun 01 '18 07:06 vassiliad

@vassiliad you should fineturn more iterations and till the learning rate decay 0.00001, and you can speed up your training process with step_size = 2000 (~=1/3 epoch).

AojunZhou avatar Jun 04 '18 13:06 AojunZhou

@Zhouaojun so I should I run each phase until the learning rate is 0.00001. In other words for the case of Alexnet I should wait until the learning rate gets multiplied by gamma 2 times. In terms of epochs this amounts to 9 full epochs for each step of the process (which can be obtained by setting max_iterations equal to 45000).

Is this the configuration that you used in the paper (instead of 63000) ?

vassiliad avatar Jun 04 '18 14:06 vassiliad

Hi @AojunZhou , I am also trying to reproduce the results for the ResNet-18 network with 2 bits weights. I am not being able to achieve the accuracy that you reported i.e ~66%. Can you please help me with the hyper-parameters that you used e.g Base Learning rate and stepsize? Thank you!

saqibjaved1 avatar Feb 17 '21 00:02 saqibjaved1