L-OBS icon indicating copy to clipboard operation
L-OBS copied to clipboard

Difficulty in reproducing NIPS'17 results for AlexNet using PyTorch code

Open adpatil2 opened this issue 6 years ago • 9 comments

Hi Shangyu,

Thanks a lot for sharing PyTorch code for applying LOBS on various ImageNet CNNs.

I could run the code perfectly after a couple of minor error/syntax corrections required due to Python version differences (2.xx vs 3.xx).

I kept almost all the default settings. I pruned AlexNet successfully and validated it on entire ImageNet validation set. However, I could not reproduce the numbers that were published in your NIP'17 paper for AlexNet.

In NIP'17 paper, for 11% CR, AlexNet achieves top1 error of 50.04% and top5 error of 26.87% without retraining. However, when I ran your PyTorch code, the resulting AlexNet only achieved top1 error of 70.37% and top5 error of 45.97%, without retraining. These error rates are much higher than the numbers reported in paper. Kindly find below the terminal output after running validate-AlexNet.py script:

`[adpatil2@csl-420-07 ImageNet]$ python validate-AlexNet.py

Overall compression rate (nnz/total): 0.127041 ==> Preparing data.. /data/L-OBS/PyTorch/ImageNet/utils.py:135: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. input_var = torch.autograd.Variable(input, volatile=True) /data/L-OBS/PyTorch/ImageNet/utils.py:136: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. target_var = torch.autograd.Variable(target, volatile=True) /data/L-OBS/PyTorch/ImageNet/utils.py:144: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number losses.update(loss.data[0], input.size(0)) Test: [0/1000] Loss 4.3182 (4.3182) Prec@1 38.000 (38.000) Prec@5 60.000 (60.000) Test: [200/1000] Loss 4.3394 (4.2516) Prec@1 24.000 (29.144) Prec@5 46.000 (53.652) Test: [400/1000] Loss 4.3978 (4.2476) Prec@1 30.000 (29.631) Prec@5 54.000 (54.354) Test: [600/1000] Loss 4.5474 (4.2397) Prec@1 18.000 (29.601) Prec@5 46.000 (54.326) Test: [800/1000] Loss 4.6500 (4.2459) Prec@1 18.000 (29.630) Prec@5 44.000 (54.102)

  • Prec@1 29.630 Prec@5 54.030 `

Will you please help me understand why we are seeing such difference? Am I running your code incorrectly? Did you carry out any additional finetuning/processing to achieve published NIP'17 results? Is that finetuning code not part of the released PyTorch code here?

Please let me know what you think. I am looking forward to hearing from you.

Ameya

adpatil2 avatar Nov 09 '18 02:11 adpatil2

Hi @adpatil234 ,

Thanks for using our code and your reports. We think it might be the problem of classes imbalance in hessian generation. Since it is a reproduction of original experiment codes, there may be something we miss. We are trying to fix this, will update you once we are clear. @XinDongol

Best regards, Shangyu

csyhhu avatar Nov 09 '18 02:11 csyhhu

Hi Shangyu,

Thank you very much for your prompt response!

It is good to know that you are considering class imbalance in hessian generation as one possible cause.

Please do let me know once you get a good insight about this issue.

Thanks again for your help. I really appreciate it.

Ameya

adpatil2 avatar Nov 09 '18 03:11 adpatil2

Hi @csyhhu,

I have another question about this issue.

In your NIPS'17 implementation, did you compute the inverse Hessians exactly once, or did you recompute them after pruning certain fraction, say >50%?

Because of the way you estimate inverse Hessians, they are a strong function of activation distribution of the current network. Once the network is pruned significantly (say >50-75%), the activation distribution has changed significantly. This is evident from the Batch Norm parameter adjustment that you require in your code.

If the activation distribution has changed significantly, the inverse Hessians that were estimated for the dense network are not a good approximation of the inverse Hessians of the pruned network anymore. Then, it would be better to recompute inverse Hessians after pruning certain fraction before proceeding with further aggressive pruning.

Your current PyTorch code (I think) computes Hessians only once. I was wondering if you had recomputed Hessians (as described above) to obtain NIPS'17 results. That would explain the above error rate discrepancy that we are currently observing.

Please let me know what you think.

Looking forward to hearing from you.

Best regards, Ameya

adpatil2 avatar Nov 11 '18 02:11 adpatil2

Hi @adpatil234 ,

We compute Hessian for once, in paper and codes.

You are right about the activation distribution change after pruning. In our paper, we assume that (Eq.4) pruned activation is similar to original activation, which leads to the elimination of the first term. As you can observe, as pruning goes on, this assumption is violated. However, we can still achieve good results at that time (as reported in NIPS2017).

Your idea of regenerating Hessian matrix should be solution. However, due to the great cost of calculating Hessian, we only compute once in our original implementation (as NIPS'2017 results) and this implementation.

Due to the different framework and different writers during these two implementation, we are still investigating the reasons. Will let you know once we find out.

Best regards, Shangyu

csyhhu avatar Nov 11 '18 02:11 csyhhu

Hi Shangyu,

Thank you for your comment. It is good to have it confirmed that you compute Hessians only once.

Do you have any update about reproducing NIP'17 AlexNet results? I also ran ResNet-18 model code that you have released as a part of PyTorch code. It seems to be returning NaN values for its accuracy during the validation of the pruned model. Hence, there seems to be some bug in that code as well. Do you observe clean output when you run it at your end? Kindly find the terminal output for ResNet-18 model pruning below. Please let me know if I am missing something.

resnet20error

Interestingly, I am not able to reproduce your NIP'17 LeNet-5 numbers (L-OBS without retraining, 7% CR entry) either, especially with the constraint that Hessians are computed only once..

Would you please provide any MNIST/CIFAR-10 code that reproduces any one entry in NIP'17 table? I understand that debugging and running ImageNet networks takes time. However, having official reproduction of MNIST/CIFAR-10 results and associated code would also be very helpful.

Thank you very much for your help.

Looking forward to hearing from you

Best regards, Ameya

adpatil2 avatar Nov 21 '18 02:11 adpatil2

Hi @adpatil234 , The reason why it produces NaN is because of the Hessian. For some layers Woodbury method should be used to generate inverse Hessian otherwise it will produces NaN in pruned weights. I have specified the layers that required Woodbury method in VGG. Maybe I miss to check it in ResNet18, I will check which layers should use Woodbury.

The bugs are because of the pytorch version, I will try to fix it.

I think the tensorflow code can produce LeNet300100. For MNIST/CIFAR-10, I will try to give examples. Thanks for your advise.

Best regards, Shangyu

csyhhu avatar Nov 21 '18 03:11 csyhhu

Hi @csyhhu,

I was wondering if there have been any updates related to this thread, since I have encountered similar issues when trying to reproduce the results via the Pytorch version.

Best regards, Dan

dalistarh avatar Aug 15 '19 15:08 dalistarh

Hi @dalistarh , can you tell me more about your issues? Like the performance after pruning and after fine-tine. Besides, you can refer to the dev branch, where I implement some other models and dataset.

csyhhu avatar Aug 16 '19 06:08 csyhhu

Hi,

I have managed to run the ResNet18/ImageNet example via the Pytorch code on the main branch.

I get the following:

fc.weight CR: 0.611482
fc.bias CR: 0.450000
Prune weights used: 2/2
Overall compression rate (nnz/total): 0.611168
[2019-08-16 04:16:11.803439] Begin adjust finish
Train: [0/500]  Loss nan (nan)  Prec@1 0.000 (0.000)    Prec@5 0.000 (0.000)
                        Train: [100/500]        Loss nan (nan)  Prec@1 0.000 (0.000)    Prec@5 0.000 (0.000)
                Train: [200/500]        Loss nan (nan)  Prec@1 0.781 (0.004)    Prec@5 3.906 (0.019)
Train: [300/500]        Loss nan (nan)  Prec@1 0.000 (0.003)    Prec@5 0.000 (0.013)
Train: [400/500]        Loss nan (nan)  Prec@1 0.000 (0.002)    Prec@5 0.000 (0.010)
Train: [500/500]        Loss nan (nan)  Prec@1 0.000 (0.002)    Prec@5 0.000 (0.008)
[2019-08-16 04:37:22.213821] Adjust finish. Now saving parameters
==> Preparing data..
Test: [0/1000]  Loss nan (nan)  Prec@1 0.000 (0.000)    Prec@5 0.000 (0.000)
Test: [20/1000] Loss nan (nan)  Prec@1 0.000 (0.000)    Prec@5 0.000 (0.000)
Test: [40/1000] Loss nan (nan)  Prec@1 0.000 (0.000)    Prec@5 0.000 (0.000)
Test: [60/1000] Loss nan (nan)  Prec@1 0.000 (0.000)    Prec@5 0.000 (0.000)
Test: [80/1000] Loss nan (nan)  Prec@1 0.000 (0.000)    Prec@5 0.000 (0.000)
Test: [100/1000]        Loss nan (nan)  Prec@1 0.000 (0.000)    Prec@5 0.000 (0.000)
 * Prec@1 0.000 Prec@5 0.000

dalistarh avatar Aug 16 '19 13:08 dalistarh