facenet Optimal hyperparameter selection for triplet loss training

Hi @davidsandberg! Thanks for your work on this repo!

When training with triplet loss using the recommended hyperparameters in the wiki, what kind of results were obtained? It'd be great if I could take a look at those results.

I'm also curious about what kind of performance others have gotten using triplet loss on VGGFace2. I can't seem to find the optimal hyperparameters that come even remotely close to the classifier model.

Apr 26 '19 07:04 varun-parthasarathy

same question! I find the accuracy and val_rate are far lower than what the paper indicated.

May 01 '19 21:05 douxiaotian

@douxiaotian what kind of results did you get? I'm currently trying out a cyclical learning rate with SGD, but it'll take a while to finish training. I'm planning to try out some other optimizers (like AdamW) instead once it's done. It might give better results that way.

May 03 '19 13:05 varun-parthasarathy

i got accuracy around 80% and tal rate lower than 10%. I think the problem might be lack of data. I am not sure how Google get the huge amount of data

May 04 '19 00:05 douxiaotian

Yeah, I think so too. I'm currently downloading the Deepglint dataset (Cleaned MS-Celeb + Asian Celeb; ~7 million images, ~180,000 identities) - my previous experiment with SGD failed miserably. I'll try to train using a CPU cluster; maybe I'll be able to increase the batch size that way.

May 04 '19 17:05 varun-parthasarathy

Hi, I tried currently to train the model using triplet loss script, using only CASIA Webface dataset (the clean version) for the training but it seems that the validation on the LFW is so low, it is around 10%~18% accuracy after more than 350 epochs. I did same alignment for both datasets with the descriptions given in the wiki, i tried using two different optimizers, RMSPROP and ADAM and both give low validation on LFW while I get around 80%~88% accuracy on the training set. Does anyone have an idea how to solve this problem ? Is it possible to fine tune using the pretrained model with the softmax loss ?

May 07 '19 14:05 kifaw

I used these params to finetune and they worked great, actually overfitted and I had to go back and pick earlier epochs.

--keep_probability 0.6 \
--weight_decay 5e-4 \
--people_per_batch 720 \
--images_per_person 5 \
--batch_size 210 \
--learning_rate 0.002 \
--learning_rate_decay_factor 0.98 \
--learning_rate_decay_epochs 10 \
--optimizer MOM \
--embedding_size 512

May 07 '19 15:05 xlphs

@xlphs Did you train using the triplet loss script or the softmax script ? And how have you come to choose those specific hyperparameters ?

May 07 '19 16:05 kifaw

@kifaw The triplet loss script of course. I went through a bunch of old issues on triplet loss and ended up with that

May 07 '19 16:05 xlphs

@xlphs your results seem promising! Just to clarify, what dataset did you finetune on? Also, have you tried training from scratch at any point?

May 07 '19 16:05 varun-parthasarathy

@Var-ji I merged folders from VGGFace2 and the deepglint asian celebrity dataset. Tried training from scratch but it didn't work, I ended up taking the softmax model and finetune it with arcface loss and finally triplet loss. Triplet loss is easy to overfit and I forgot to remove the overlap VGGFace2 has with LFW, so my accuracy of 99.7% on LFW is not that reliable, but then again you only have like a dozen pairs at this point so imho it's not worthwhile to pursue higher accuracy, but rather try on some other datasets. (Triplet loss gave like 0.1% accuracy increase after arcface loss)

May 07 '19 17:05 xlphs

@Thank you for sharing with us what have you done I appreciate that ! So it seems that you used the pretrained model provided in this repo isn't it ? And just to clarify, by removing the overlap that a dataset has with LFW how does the validation on the LFW works while none of the identities on LFW were in the training set ?

May 07 '19 17:05 kifaw

@kifaw the idea of validation is to see how well the model generalizes on data it hasn't seen before, so if the overlap is still present, then there will be some bias associated with the model results since it has, in fact, seen that data before. Thanks to this, validation accuracy will be higher than expected.

May 07 '19 17:05 varun-parthasarathy

I ran a learning rate range test a while back; the results are interesting - final_plot Does this mean larger learning rates would perform well? Can someone clarify this? This was run with a batch size of 120 on VGGFace2, with people_per_batch=90, images_per_person=40 and SGDW with Nesterov momentum as the optimizer.

I also wanted to point out something I realized - the FaceNet paper only selects random semi-hard triplets for training; the default method in the code selects both semi-hard as well as hard triplets. Is it possible that this is what's leading to poor convergence?

May 07 '19 17:05 varun-parthasarathy

@Var-ji Thanks for clarifying, so when using the LFW dataset to validate, it compare between the identities in the LFW dataset itself, is this what it does ? For the plot you showed us above, I find it strange having better performance while the learning rate is high, while what I know is that the learning rate should be a small value ! Does it have any explanation ? Thanks again for your replies!

May 07 '19 21:05 kifaw

@kifaw that's something I unfortunately don't understand myself. I'm running some more range tests right now using the FaceNet triplet selection method, but I find it strange that the learning rate seems like it can be increased up to 2 without causing too much fluctuation in training. However, one possibility is that we keep selecting new triplets as we train, and thanks to this the loss values will decrease, even though they become more and more noisy.

May 08 '19 12:05 varun-parthasarathy

I guess there were some issues with the range test (I didn't run it for long enough). I ran it for about 20000 steps and got a more reasonable range of 0.075 to 0.4. I also got a chance to ask one of the authors of the paper about the hyperparameter settings, and he said that while he can't give any generic settings, the learning rate for triplet loss is generally always higher than what would be used for a softmax-based classifier.

May 16 '19 16:05 varun-parthasarathy

@Var-ji what do you mean by the range test ? I tried training using triplet loss three times using train_tripletloss.py script using casia webface dataset for more than 300epochs but it doesn't give any results in validation. Have you got any good results lately ?

May 19 '19 21:05 neklom

@neklom the range test essentially involves slowly increasing the learning rate over time, while tracking loss vs. learning rate. At a certain value of the learning rate, loss falls drastically and then levels off. The range of learning rate values for which this drop is seen is the optimal range for training your network. I haven't gotten any good results lately. While I can get accuracies of 90+%, validation generally is about 20% at best.

May 20 '19 14:05 varun-parthasarathy

@Var-ji thanks for the explanation. I'm facing the same issue for the training, i get more than 86% acc on training using the casia webface and aligning it as it was mentioned in the wiki, same for LFW, but i get like from ~11% to ~18% on validation and I don't get what's the problem. I hope if anyone can help us with this !

May 21 '19 00:05 neklom

@Var-ji That validation rate can’t be right. I trained many times and val rate is always slightly below accuracy once it goes above 0.9, so if acc is 0.99 then val rate might be 0.98. You should examine the actual value of your embeddings, perhaps they became extremely small.

May 21 '19 02:05 xlphs

@xlphs From my experience, training seems to become unstable once accuracy crosses 0.9 - the validation rate starts fluctuating wildly between 0.2 and 0.5. I generally stop training at this point; however it is possible that near the optimal minima, the gradient becomes quite bumpy. I'll try continuing training beyond this point and seeing what happens.

May 21 '19 09:05 varun-parthasarathy

@xlphs you've had those results with the parameters you provided above and using arcface loss too ? And what about the alignment, how did you align the data ? Can you share with us your final code please, it's been weeks I'm trying to solve this problem and thank you !

May 21 '19 11:05 neklom

Here's tensorboard lfw graphs using arcface loss, don't have logs anymore for others, triplet loss looks similar enough. I use training scripts from this repo, the code is good enough especially train_tripletloss.py notice it doesn't use tricks like random flip or subtract mean to increase lfw accuracy.

May 21 '19 16:05 xlphs

Thank you @xlphs

May 28 '19 01:05 neklom

Training from scratch with triplet loss gives an accuracy of about 92.5% (similar to OpenFace), while validation tends to vary between 35 to 40%, even after 800k iterations. I guess with small batch sizes, this is the maximum accuracy that can be reached.

May 30 '19 08:05 varun-parthasarathy

@Var-ji As I remember from reading Vggface2 paper, they trained their model from scratch using softmax loss first then they fine-tuned it using triplet loss after. Would it be effective ? What I've read before, that triplet loss needs training for many epochs, more like 1000 or more that's why it doesn't give a good accuracy on the validation set, would it be ? And what do you mean by iterations here ?

Jun 01 '19 03:06 kifaw

From what I've read so far, training with softmax and then fine-tuning using triplet loss can be very effective; the problem is that when you have a large number of classes, training using softmax becomes problematic. If you're training with softmax on VGGFace2, then fine-tuning on your own dataset it should be fine, although I haven't tested this yet. From my experiments, I think that increasing the embedding size can boost triplet loss performance. While the paper showed decreasing performance with increasing embedding size, I think there's a trade-off between the dataset size and embedding dimensions - when the dataset is small, it's better to capture a larger number of partially relevant features than capturing a small number of highly relevant features. If you want to use a small embedding size, you'd have to train for much longer, but that brings in the risk of over-fitting.

Jun 01 '19 03:06 varun-parthasarathy

I'll try using a high embedding size maybe it would be effective as you said, but Openface used 128D embedding vector on CASIA webface and they had a good accuracy though, would it be a matter of architecture used maybe ?

Jun 01 '19 12:06 kifaw

I don't really think so - I was able to replicate the OpenFace results (as I mentioned 2 days ago) using a 128D embedding; however, it did take nearly 800,000 iterations to reach that point. I'm currently training using a 512D embedding, and it reached 91% accuracy in only 60000 iterations, and can be expected to improve further from there. I would recommend using a cyclic learning rate, as it allows you to explore the loss function and find potentially better minima.

Jun 01 '19 12:06 varun-parthasarathy

I'll try your suggestions, thank you very much !

Jun 02 '19 13:06 kifaw

facenet facenet copied to clipboard

Optimal hyperparameter selection for triplet loss training

facenet
facenet copied to clipboard