caffe-face icon indicating copy to clipboard operation
caffe-face copied to clipboard

Optimize Backward Time Complexity to O(MK)

Open mfs6174 opened this issue 9 years ago • 113 comments

In the original implementation, the time complexity of the backward process of the center loss layer is O(MK+NM). It will be very slow when training with a large number of classes since the running time of the backward pass is related to the class number (N). Unfortunately, it is a common case when training face recognition model (e.g. 750k unique persons).

This pull request rewrites the backward code. The time complexity is optimized to O(MK) with additional O(N) space. Because M (batch size) << N and K (feature length) << N usually hold for face recognition problem, this modification will improve the training speed significantly.

For a Googlenet v2 model trained with Everphoto's 750k unique person dataset, on a single Nvidia GTX Titan X, with 24 batch size and iter_size = 5, the average backward iteration time for different cases is:

  1. Softmax only: 230ms
  2. Softmax + Center loss, original implementation: 3485ms, center loss layer: 3332ms
  3. Softmax + Center loss, implementation in this PR: 235.6ms, center loss layer: 5.4ms

There is more than 600x improvement.

For the paper author's "minit_example", running on a single GTX Titan X, training time of the original implementation and the PR is 4min20s V.S. 3min50s. It is shown that even when training with small dataset with only 10 classes, there still is some improvement.

The PR also fix the code style to pass the Caffe's lint test (make lint).

mfs6174 avatar Oct 21 '16 11:10 mfs6174

@mfs6174 , have you reproduced the result on LFW or Megaface? I have trained the model on CASIA and tested on LFW ,but it didn't work well (EER~96.5%).

jiangxuehan avatar Oct 24 '16 00:10 jiangxuehan

Hi, @jiangxuehan

I am still working on that with both paper author's code and my code.

I have only tested my PR's code with the MNIST toy example. Starting from the same snapshot and training data (no shuffle during training), my code produced exactly the same center_diff values and nearly the same test result with the author's code.

Which code did you trained the model with? Was it the paper author's code or my PR's code? If you can reproduce the result with the paper author's code but not my code, I will check my code again. If you can neither reproduce the result with the paper author's code, I will discuss with you further when I finish my experiments for reproducing the result on LFW.

mfs6174 avatar Oct 24 '16 09:10 mfs6174

Hi @jiangxuehan and @mfs6174, I use MTCNN to get the 5pts landmarks and use his provided model. The accuracy in LFW in view 2 is just 96.55 +/-0.229129. If I use his provided feature, the accuracy is 98.98 +/- 0.186685. If I use other way to get 5pts landmarks and use his provide model, result is 98.75 +/-0.194754. If I train from scratch using my 5 pts landmarks, the result is 98.47 +/- 0.211986. I hope this will give you some guidelines. Besides, I found that their system cannot use dropout as the dropout will make the loss to NAN. Do you guys have the same problem when dropout is applied?

Would you like to verify two points? 1, Does lamda in paper be "loss_weight"? if some, where can I set alpha?

chichan01 avatar Oct 24 '16 10:10 chichan01

Hi, @mfs6174 @chichan01 , to mfs6174 :
I used the paper author's code. Further discussions after you finish experiments will be appreciated. (By the way,your optimized code is correct and faster) to chichan01:
Your result seems reasonable. Did you use the same code\model\data as the author's PR? Have you ever made some changes except landmarks? I have tried to use dropout for the 512d-fc layer, the loss didn't produce NAN value. I think center loss acts as a regular term , it's more reasonable to compare softmax+dropout vs. softmax+center_loss. Have you tried softmax+dropout?

jiangxuehan avatar Oct 24 '16 11:10 jiangxuehan

Hi @jiangxuehan, I only try to reproduce their result in this moment and therefore I did not change anything. I did try to apply centre loss in other architecture but it seems that it need to tune loss_weight. Anyway, I will try to compare it with dropout.

chichan01 avatar Oct 24 '16 11:10 chichan01

@chichan01 : Could you please send your train log files to me(both 98.47 and NAN )? I want to compare loss curves and hope it can provide some useful information. My email is [email protected], Thanks.

jiangxuehan avatar Oct 24 '16 12:10 jiangxuehan

Hi, @jiangxuehan @chichan01

Regarding reproducing the face result, I also have some questions. When training the network, did you use the author's prototxt directly (center loss parameters: lr_multi = 1, weight_decay_multi = 2 and loss_weight = 0.008 ) or change it following the description in the paper ( the parameters should be lr_multi = 5 (so the alpha is 0.1*5 = 0.5) and loss_weight = 0.003 ) ?

mfs6174 avatar Oct 24 '16 17:10 mfs6174

Hi @jiangxuehan and @mfs6174, I directly use author's prototxt as their network in prototxt is not similar to their paper, therefore, I think following the description in the paper may not right. Certainly, if you test it based on their description, please let me know the result. Also, I would like to thank for telling me that how I get the alpha. By the way, I only use subset of CASIA-webface original version (not cleaned version) because of non-overlapping with IJCB, which only have 10549 subjects. Another point I would like to highlight is that all of my results are cosine distance without applying PCA. Perhaps, applying PCA will improve the performance. However, the question is that which dataset they used to train PCA (LFW or CASIA)? To @jiangxuehan , I am sorry that I am not able to give you the log file as the log files were corrupted. Perhaps, I can give a few days later if you do not mind.

By the way, did you guys able to reproduce their megaface result?

chichan01 avatar Oct 25 '16 00:10 chichan01

Hi @mfs6174 @chichan01 As mfs6174 metioned, center = center - lr_mult_lr_d(center), so alpha = lr_mult_lr.With lr decrease from 0.1 to 0.01/0.001, we should set lr_mult to 5/50/500 theoretically. Besides , weight_decay should be set to 0. Another method is rewrite the Backward function : center = center - alpha_d(center). If @chichan01 can reproduce the result with this PR, maybe alpha is not so important? (I have not do any experiment about alpha, above content is just my own comprehension. Could @ydwen explain about alpha for us? Thanks.)

To chichan01: I am still working on LFW, and meeting some obsessions to get ~99% accuracy. How about your experiment on Megaface?

jiangxuehan avatar Oct 25 '16 01:10 jiangxuehan

Hi,

I am still working on that with both paper author's code and my code.

I have only tested my PR's code with the MNIST toy example. Starting from the same snapshot and training data (no shuffle during training), my code produced exactly the same center_diff values and nearly the same test result with the author's code.

Which code did you trained the model with? Was it the paper author's code or my PR's code? If you can reproduce the result with the paper author's code but not my code, I will check my code again. If you can neither reproduce the result with the paper author's code, I will discuss with you further when I finish my experiments for reproducing the result on LFW.

2016-10-24 8:59 GMT+08:00 jiangxuehan [email protected]:

@mfs6174 https://github.com/mfs6174 , have you reproduced the result on LFW or Megaface? I have trained the model on CASIA and tested on LFW ,but it didn't work well (EER~96.5%).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ydwen/caffe-face/pull/1#issuecomment-255627529, or mute the thread https://github.com/notifications/unsubscribe-auth/AAvOsgRKov1iq31hSBSj-KV03zDjR8tvks5q3AL8gaJpZM4KdJSj .


mfs6174 Web mfs6174.org

mfs6174 avatar Oct 25 '16 18:10 mfs6174

@jiangxuehan @chichan01 @ @mfs6174 Sorry for replying this late. I was quite busy these days. Thanks for the commit from @mfs6174, I will check it ASAP.

Here I try to answer some of your questions. If anything is not clear, please feel free to let me know.

Common issues Dropout: if you want to combine dropout, the current code may not support. Concretely, if some of the elements in x is omitted, the corresponding elements in center should be omit as well. If not, the loss becomes unstable.

Alpha: At the beginning, we use the update strategy as in paper. It performs well in our experiments (deepidnet & mixed dataset). When refactoring the code, we find that the implementation of updating alpha in caffe is not elegant and becomes complicated with multi-gpu. Finally we try lr and weight_decay in caffe and find it is very convenient and works pretty well (achieving near the same performance as before). So you guys can try modifying the lr and weight_decay for different alphas.

Network Architecture: At the beginning we use the deepidnet so we have to implement the local convolution layer. We are not going to released our implementation since it is quite inelegant (^_^). The most important is, local convolution layer is complicated and not efficient in time (~8h for resnet with 28 layers v.s. ~14h for deepidnet with 6 layers) and space (~100M for resnet with 28 layers v.s. ~200M for deepidnet with 6 layers). Therefore, we use a modified resnet as our network. It performs better with less parameters and training time.

To @chichan01 , I guess the patch you used is not the same as in the demo or the given model. Please double check the positions of eyes, nose, mouth corner in the cropped face and use our provided template (5 points landmarks) given in demo.

To @chichan01 , loss weight is related to the total number of classes, i.e. num_output in fc6. Generally speaking, the more classes, the smaller loss weight is.

To @jiangxuehan Please provide more details of your experiment (96.5% EER), or I can't give you some hints.

ydwen avatar Oct 27 '16 02:10 ydwen

Hi people, I have tried to train ydwen's prototxt from scratch, using my own dataset with about 1.5M images of 13,650 subjects (each image is resized to 112 pixels height, 96 pixels width). However, center loss is increasing (softmax loss decreasing fine). It decreases in the beginning, but after iteration 800 it starts to increase and never decrease again. The values (without 0.008 multiplier) at the start and in every 100 iterations after then, seems somewhat like this: 20, 3, 11, 3, 4, 3, 4, 8, 10, 12, 24, 24, 23, 30, 27, 35, 40, 42, 46, 58, .... I just stopped training after then. What would be wrong with my settings or dataset. I already have other models (with softmax loss) converged with the same dataset? Is center loss very sensitive to an inbalance in the dataset?

kkirtac avatar Oct 27 '16 06:10 kkirtac

@chichan01

If I use other way to get 5pts landmarks and use his provide model, result is 98.75 +/-0.194754.

could you tell me that which way you used to get 5pts landmarks and how you use these 5 points ? just as the author's demo code?

ZHAIXINGZHAIYUE avatar Oct 27 '16 12:10 ZHAIXINGZHAIYUE

@kkirtac There is nothing wrong that center loss keeps increasing during the training. Just make sure the total loss (softmax loss + \lambda center loss) is decreasing.

ydwen avatar Oct 27 '16 14:10 ydwen

Hi, guys,

I have nearly reproduced the paper author's LFW result with both the author's code and my code. The result on LFW is nearly 98.9% with cleaned CAISA-webface dataset using author's network and solver prototxt. The only change is the step-size of the learning rate decay which improves the loss stability during training.

With my Everphoto 750k unique person dataset, I found that modifying the usage and the training scheme of the center loss could lead to possibly better result and much better convergence speed with more stable loss values when training with a dataset with very large number of classes. I will consider releasing the details after further experiments.

mfs6174 avatar Oct 28 '16 04:10 mfs6174

@ydwen Due to the increasing behavior of my center loss, my total loss shows fluctuations. With a proper configuration, center loss should not be increasing I think. How about your trainings? Do you have similar increasing center loss values as mine?

kkirtac avatar Oct 28 '16 05:10 kkirtac

@mfs6174, Your performance of the trained model is very near to ydwen's feature. If you do the PCA, you may approach to 99.27% as I follow their paper to project ydwen's feature to pca space before computing the cosine angle. @ydwen, @ZHAIXINGZHAIYUE and @twinsyssy1018, I would like to thank for your response. Your comment basically covers all of my questions. Regarding on the landmark, I did refer the 5 pts landmark you provided in extractDeepFeature.m for Jennifer_Aniston_0016.jpg (provided template ) and found that MTCNN cannot provide the same 5 pt locations. Therefore, I use TCDCN (http://mmlab.ie.cuhk.edu.hk/projects/TCDCN.html) to extract. @ydwen, Regarding on dropout, it seems that your model A in the paper did not apply dropout and I found that applying dropout for softmax does improve the performance and the performance of model A can be near to model C in no PCA space. Also I found that the model A is suffered by overfitting if you increase the number of iteration while, model C did not. Another observation is that softmax with center loss is good for PCA space but softmax (with/without dropout) is not good.

Anyway, I will try to reproduce it on MegaFace in this coming week(s) as LFW is saturated and cannot see the real difference.

chichan01 avatar Oct 28 '16 09:10 chichan01

@mfs6174 hi, where can i get the cleaned CASIA-dataset? I can not found this cleaned version in the Internet.


Yaxiong Chi Email: [email protected] Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education

Xidian University No.2 Southern Tai Bai Rd Xi An, Shaanxi

China

At 2016-10-28 12:40:18, "mfs6174" [email protected] wrote:

Hi, guys,

I have nearly reproduced the paper author's LFW result with both the author's code and my code. The result on LFW is nearly 98.9% with cleaned CAISA-webface dataset using author's network and solver prototxt. The only change is the step-size of the learning rate decay which improves the loss stability during training.

With my Everphoto 750k unique person dataset, I found that modifying the usage and the training scheme of the center loss could lead to possibly better result and much better convergence speed with more stable loss values when training with a dataset with very large number of classes. I will consider releasing the details after further experiments.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

yaxiongchi avatar Oct 28 '16 14:10 yaxiongchi

@twinsyssy1018, hi there, where can I get the clearn CASIA dataset?


Yaxiong Chi Email: [email protected] Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education

Xidian University No.2 Southern Tai Bai Rd Xi An, Shaanxi

China

At 2016-10-28 18:45:52, "twinsyssy1018" [email protected] wrote:

@mfs6174 , I trained on clean CASIA dataset , and reach 98.4% at last . Waiting for the details

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

yaxiongchi avatar Oct 30 '16 08:10 yaxiongchi

@chichan01 where do you put dropout actually?

kkirtac avatar Oct 31 '16 06:10 kkirtac

@chichan01, i also have a question ,which dataset used to train PCA? i used the lfw to train pca,and got 99.0% ,not as good as yours 99.27%.

twinsyssy1018 avatar Oct 31 '16 09:10 twinsyssy1018

@ yaxiongchi try this http://pan.baidu.com/s/1kUdRRJT password:3zbb . i can not promise it is the right version ,but i use it.

twinsyssy1018 avatar Oct 31 '16 09:10 twinsyssy1018

@ydwen

Hi, My dataset with about 4M images of 80000 subjects. When I set loss_weight: 0.008 or loss_weight: 0.0005, softmax_loss can not decrease. When I set loss_weight: 0.0001, I get below training log, the softmax_loss is decreasing, but center_loss is still very high. I wonder when I can finish the training?

I1101 08:16:01.237038 6772 sgd_solver.cpp:106] Iteration 322000, lr = 1e-005 I1101 08:17:48.097226 6772 solver.cpp:228] Iteration 322100, loss = 0.640129 I1101 08:17:48.097226 6772 solver.cpp:244] Train net output #0: center_loss = 4113.34 (* 0.0001 = 0.411334 loss) I1101 08:17:48.097226 6772 solver.cpp:244] Train net output #1: softmax_los s = 0.228794 (* 1 = 0.228794 loss) I1101 08:17:48.112826 6772 sgd_solver.cpp:106] Iteration 322100, lr = 1e-005 I1101 08:19:34.614213 6772 solver.cpp:228] Iteration 322200, loss = 0.639561 I1101 08:19:34.614213 6772 solver.cpp:244] Train net output #0: center_loss = 4207.33 (* 0.0001 = 0.420733 loss) I1101 08:19:34.614213 6772 solver.cpp:244] Train net output #1: softmax_los s = 0.218827 (* 1 = 0.218827 loss) I1101 08:19:34.614213 6772 sgd_solver.cpp:106] Iteration 322200, lr = 1e-005 I1101 08:21:21.521200 6772 solver.cpp:228] Iteration 322300, loss = 0.529362 I1101 08:21:21.521200 6772 solver.cpp:244] Train net output #0: center_loss = 3991.51 (* 0.0001 = 0.399151 loss) I1101 08:21:21.521200 6772 solver.cpp:244] Train net output #1: softmax_los s = 0.130211 (* 1 = 0.130211 loss) I1101 08:21:21.521200 6772 sgd_solver.cpp:106] Iteration 322300, lr = 1e-005 I1101 08:23:08.209789 6772 solver.cpp:228] Iteration 322400, loss = 0.548019 I1101 08:23:08.209789 6772 solver.cpp:244] Train net output #0: center_loss = 3857.7 (* 0.0001 = 0.38577 loss) I1101 08:23:08.209789 6772 solver.cpp:244] Train net output #1: softmax_los s = 0.162249 (* 1 = 0.162249 loss)

zjchuyp avatar Nov 01 '16 00:11 zjchuyp

@twinsyssy1018 你好 我用作者的网络训练 最高也只能得到98.45%的成绩,和你差不多,请问你现在找到问题的症结所在了吗? 对齐方法还是网络参数影响更大一些?

gzp001015 avatar Nov 01 '16 01:11 gzp001015

@mfs6174 请大神不吝赐教啊,究竟该怎么训练 ,我最高只能得到98.45%,我目前只是调整lossweight而已。

gzp001015 avatar Nov 01 '16 01:11 gzp001015

@twinsyssy1018 would you mind to give me the LFW dataset link? I can't find this due to the link from Internet is error,thanks

duanLH avatar Nov 03 '16 02:11 duanLH

@ydwen ,I want to change the λ, where should I do?

duanLH avatar Nov 03 '16 03:11 duanLH

@zjchuyp I encountered the same problem as you by using a larger data set .I intended to train another 400000 iteration

westpilgrim avatar Nov 03 '16 06:11 westpilgrim

@westpilgrim @zjchuyp For the large dataset, there should be some modification to improve the convergence. I have successfully trained a model with center loss on a large dataset (750k unique person). The details will be released after further experiments. I suggest you try playing with some normalization before sending the feature vector to softmax/center loss.

mfs6174 avatar Nov 03 '16 06:11 mfs6174

@gzp001015 你好模型训练完的话,要怎么测试看准确率呀?

getengqing avatar Nov 03 '16 09:11 getengqing