self-critical.pytorch Worse performance in AoA model

Hi @ruotianluo. Thanks for your codebase. I trained AoA model using your code and your params config. however the performance i get is worser than that paper report. In xe training stage, i use ./configs/aoa.yml and get 1.120 cider, in self-critical stage i use ./configs/aoa_nsc.yml and only get 1.253 cider (1.298 cider paper reported). Could you explain what happened? Thank you.

Sep 10 '20 03:09 zhangxuying1004

I actually found that too. As I remember, I found that the difference seems to be caused by this commit https://github.com/ruotianluo/self-critical.pytorch/commit/30a0e7b4b572b1a48d64d2f7f3574493fd3c7d56.

If you checkout to this commit before, the result should be close.

However, I don't know why. Since it actually didn't affect other models like transformer, I actually gave up debugging it.

Sep 10 '20 03:09 ruotianluo

@ruotianluo I have tested the performance many times based on AOA's original code. However, i found that i couldn't get close to the cider score, which they mentioned in the paper(118.3 vs 119.8 ). Can you give me some advice?

Sep 12 '20 12:09 upccpu

I forgot if I got 119.8. I mostly looked at the self-critical performance actually. You may want to ask the authors.

Sep 12 '20 14:09 ruotianluo

@ruotianluo 谢谢您的回复。我的英语水平不好，直接上中文了。我用了他给的源代码（保持论文相同地参数），也测试了使用SCST后的效果，但结果并没有他论文里的那么好（少1.5-2的百分点）。我在github上给作者留了言，他没有回复我。因为您在这方面是权威，所以我就想问一下您是否能给出一些可能的原因。万分感谢您的这个项目，这几年来它给了我很大的帮助。

Sep 12 '20 14:09 upccpu

你跑过几次？我去看看我之前跑的

Sep 12 '20 14:09 ruotianluo

你的结果跟他repo里面说的是差不多的呀

Sep 12 '20 14:09 ruotianluo

@ruotianluo 我跑了4次他的源代码，最高就在128.3这样(beamsize设置的3).我现在基于他的源代码加入了一些我的东西，然后能得到129.6的结果。但是还是没有他论文结果好。所以我也没办法下手写这篇文章，感觉这样下笔很不严谨。

Sep 12 '20 14:09 upccpu

@ruotianluo 他用文章里的比github上的结果高不少，能到129.8。他解释给别人的原因是随机初始化导致的，但我感觉不会差那么多。

Sep 12 '20 14:09 upccpu

他文章里用的是repo里面的schedule吗。可能他文章里train的比较长？？

Sep 12 '20 15:09 ruotianluo

@ruotianluo 我把AOA反复读了好多遍，基本确保跟他的schedule一致，训练周期也是一样的。唯有两个地方不太确定，第一个是他论文里没有说beamsize，我设置的3。第二个是我不太确定他是从所有模型种选的test最好还是仅仅用val最好的模型去测试的test。

Sep 13 '20 01:09 upccpu

我也不知道了。

Sep 14 '20 20:09 ruotianluo

@upccpu 1.beamsize为2；2.确认你使用的词汇表大小为10369（这与self-critical.pytorch的默认设置是不同的），如果设置是这样的，可能达到与repo接近的点数。repo的代码经历了一次重新组织，所以点数与paper有些不同，有些分数变低了（METEOR，CIDEr），另外一些分数变高了（BLEU，ROUGE，SPICE）。如果你在repo的代码基础上做的实验，建议你与repo中的点数做比较。

Sep 22 '20 01:09 husthuaan

@husthuaan 谢谢您的回复。我之前的设置均于paper保持了一致（beamsize=2，词汇=10369）。经过几次实验我发现的确得到了类似您描述的结果。我还有一个疑问：是否beamsize=3或5时效果会更好一些呢？

Sep 22 '20 01:09 upccpu

@upccpu 我做实验时，beamsize=2更好。

Sep 22 '20 01:09 husthuaan

@husthuaan 好的，明白了！您的这篇文章很棒，组里让我follow您这篇文章，过程中出现了一些问题。感谢您的回复！

Sep 22 '20 02:09 upccpu

@upccpu 不客气，也很抱歉没有及时回复。之后如果有问题，欢迎继续讨论

Sep 22 '20 02:09 husthuaan

@husthuaan ok！

Sep 22 '20 02:09 upccpu

@upccpu 你好，请问你解决了这个问题了吗？我基于@ruotianluo的代码中的aoa模型做了一些改进工作，在xe训练的时候，结果相比原rep是有一点提升的，但是在用rl优化时，效果就很差，b1最高79.8. 请问你在rl优化时的参数设置是怎样的呢？

Oct 29 '20 13:10 WeitaoJ

Hi @ruotianluo. Thanks for your codebase. I trained AoA model using your code and your params config. however the performance i get is worser than that paper report. In xe training stage, i use ./configs/aoa.yml and get 1.120 cider, in self-critical stage i use ./configs/aoa_nsc.yml and only get 1.253 cider (1.298 cider paper reported). Could you explain what happened? Thank you.

I also find the issue. After RL, AoA achieves 125.5 of CIDEr and using XE, AoA achieves 115.1 of CIDEr.

Feb 03 '21 06:02 qingzwang

@qingzwang 我找不到bug5555555555555555555555555。用作者的code吧。。。。。似乎是我改了https://github.com/ruotianluo/self-critical.pytorch/commit/30a0e7b4b572b1a48d64d2f7f3574493fd3c7d56，但是我始终没找到到底是哪改错了。

Feb 03 '21 06:02 ruotianluo

@qingzwang 我找不到bug5555555555555555555555555。用作者的code吧。。。。。似乎是我改了30a0e7b

作者code也是基于你早期self-critical，没发现太多不同，好难啊

Feb 03 '21 06:02 qingzwang

你要不帮我看看？我自己看可能会盲目。我基本上确定是由于这个commit导致的你也可以试试，回滚到这个commit之前就对了

Feb 03 '21 06:02 ruotianluo

你要不帮我看看？我自己看可能会盲目。我基本上确定是由于这个commit导致的你也可以试试，回滚到这个commit之前就对了

我先对比一下这俩版本

Feb 03 '21 06:02 qingzwang

@ruotianluo @upccpu @zhangxuying1004 I have tested the publicly available code released by the author and this repository. Both of them could obtain CIDEr of 118.0, but cannot achieve 119.8 reported in the paper.

aoanet refers to the publicly available code released by the author and aoanet2 refers to this repository. The setting is the same as AoA paper. Using beam size = 2 achieves the best results and both of them achieve CIDEr of 118.0, but there is a gap between the results and that in the AoA paper. By the way, I use the model that performs the best on Karpathy's validation split (5000 images) and test this model on Karpathy's test split (5000 images).

I guess, to achieve 119+ of CIDEr, you may select the model that performs the best on the test split, but do NOT do this for fair comparisons.

Feb 08 '21 08:02 qingzwang

self-critical.pytorch self-critical.pytorch copied to clipboard

Worse performance in AoA model

self-critical.pytorch
self-critical.pytorch copied to clipboard