C3D-tensorflow
C3D-tensorflow copied to clipboard
Performance difference compared to the paper
Hi
Thanks for providing the tensorflow version of C3D. You mention Top-1 accuracy of 72.6% on validation, while they report 85.2% in the paper (on test). How come the performance is so much lower? Is there such a big difference between validation and test? Or can it be explained with other reasons?
Hi @gyglim Actually I am not very clear about this problem: in the paper 3 Nets and a svm liblinear have been used as a classifier, maybe this is the reason why it make difference.
Hi @hx173149
Certainly the 3 nets help a bit, but I can't imagine it to be 12%. Using linear SVM instead of fine-tuning should only decrease performance. Have you ever reached out to them to ask for details about the training procedure, splits and such that they use?
@gyglim I also can't believe that 3 nets can make 10~12% difference... Actually when I run the original caffe implementation on my own machine (but note that batch size is 16 rather than 30 and i kept the learning rate same), I got 78.3% accuracy. Actually if you check the C3D User Guide which provided by the author, it claims that you will get 80.1% accuracy after fine tuning.
Another issue here is, I also run my own tensorflow implementation (I got very similar results with the original caffe implementation on training UCF-101 from scratch!!!) not the @hx173149 's one and I also got 72.9%. So I'm trying to investigate the issue right now. @hx173149 Which layers did you copy from the pre-trained model when you started to fine tuning? Also what was your learning rate,batch size,step size and etc.?
That is strange. There was an issue with the initial pycaffe wrapper that the mean was off, see https://github.com/facebook/C3D/pull/59
might be that?
What I did to ensure that the output is identical (for the lasagne version https://github.com/Lasagne/Recipes/pull/41), is:
- run the caffe version to extract the input and later layers and then run this through my implementation and compare at each layer level.
It's hard to know what makes the difference otherwise. Might also be in the pooling, through different padding, or many other things.
Cheers, Michael
I believe the way they extract the features is the key. In the paper it is mentioned: features are extracted with an overlap of 8 frames, then averaged across the whole video, thus having a single feature vector per video of dimension 4096. Then, these features are provided to a linear SVM. Thus the performance boost (82%) comes, in my opinion, from the temporal aggregation (averaged).
By the way @hx173149 could you please provide the SPORTS1M model without the finetuning on UCF101?
Best, Luis.
hey guys, SPORTS1M model without the finetuning on UCF101 are now available,and i have more robust result on the new PR, you can check it on my github. @gyglim @LuisTbx @ardasnck