meshed-memory-transformer icon indicating copy to clipboard operation
meshed-memory-transformer copied to clipboard

the caption is incomplete

Open whqwill opened this issue 3 years ago • 22 comments

I use https://github.com/peteanderson80/bottom-up-attention/ for feature extraction on my own images, and then run the image caption model, but the result caption is incomplete.

e.g. image

caption: "a view of a city with a building in the"

image

caption: "a view of a city with a view of a river and a"

image

caption: "a woman in a yellow dress walking on a"

It seems the result is truncated.

whqwill avatar Mar 25 '21 09:03 whqwill

I also did the same thing on the coco dataset, some results have same problem:

000000000090.jpg image

a field with a tree and a cow grazing in

000000000128.jpg image

an elephant standing next to a box on a

000000000180.jpg image

a black bear walking in the grass next to a

whqwill avatar Mar 25 '21 09:03 whqwill

Do you solve this problem?

batooooo avatar Apr 01 '21 08:04 batooooo

I have the same problem

syyyyyw avatar May 11 '21 14:05 syyyyyw

I have the same problem. What is the number of instance of ROI features you used as the input? I used 36.

vongkhmer avatar May 29 '21 04:05 vongkhmer

updated: I change the number ROI to 100 and I got a complete caption. But it is less accurate. I don't know what the recommended number of ROI.

This is the caption when I used 36 ROI:

'a', 'man', 'is', 'riding', 'a', 'horse', 'in', 'a', ' ', ' ', ' ', ' ', ' '

This is the caption when I used 100 ROI:

'two', 'people', 'riding', 'horses', 'in', 'a', 'group', 'of', 'people'

input

vongkhmer avatar May 29 '21 04:05 vongkhmer

Do you use their pretrained model or you re-trained it on your own? Because I faced the same problem but I think their model works correctly only if you use exactly their same model and setting for feature extraction (there are several in the bottom-up-attention repo), since it is trained using them. If you just increment the number of ROIS you also add noise to the image representation and I think it's normal that the caption it's less accurate.

In the following days I'm planning to train the model with my own feature extraction models and see what happens, I'll keep you updated

eugeniotonanzi avatar Jun 04 '21 09:06 eugeniotonanzi

Can you please tell me how to get the caption through this image captioning model? Because I couldn't find any caption output in my folder....

Doris1887 avatar Jun 13 '21 08:06 Doris1887

Hello,

I trained a model with the default parameters and also noticed the same issue. The pretrained model that is available from the link on the description of the repo seems to also produce incomplete captions. I did some digging and I believe that the issue stems from the implementation of optimization using the self critical loss. According to the authors of the self critical loss paper (Appendix section E):

One detail that was crucial to optimizing CIDEr to produce better models was to include the EOS tag as a word. When the EOS word was omitted, trivial sentence fragments such as “with a” and “and a” were dominating the metric gains, despite the `gaming' counter-measures (sentence length and precision clipping) that are included in CIDEr-D [13], which is what we optimized. Including the EOS tag substantially lowers the reward allocated to incomplete sentences, and completely resolved this issue. Another more obvious detail that is important is to associate the reward for the sentence with the first EOS encountered. Omitting the reward from the first EOS fails to reward sentence completion which leads to run-on, and rewarding any words that follow the first EOS token is inconsistent with the decoding procedure.

In my case, I noticed that all incomplete captions were missing a reference to a noun (possibly with an adjective), just like the above examples. From my understanding, it appears that the model is reluctant to produce that noun and the learnt policy indicates that it is better to generate an incomplete caption and receive the adjusted reward rather than make a 'risky' prediction. The solution, just like the authors said, was to simply include the EOS token in both candidate and reference captions

I simply defined the add_eos boolean variable to distinguish between decoding during RL optimization and decoding during evaluation and modified the loop inside tokenizer.py

  # create dictionary for tokenized captions
  for k, line in zip(image_id, lines):
      if not k in tokenized_corpus:
          tokenized_corpus[k] = []
      tokenized_caption = ' '.join([w for w in line.rstrip().split(' ') \
                                    if w not in cls.punctuations])
      if add_eos:
          tokenized_caption += " {}".format(cls.eos_token)
      tokenized_corpus[k].append(tokenized_caption)

gpantaz avatar Jul 05 '21 08:07 gpantaz

I use https://github.com/peteanderson80/bottom-up-attention/ for feature extraction on my own images, and then run the image caption model, but the result caption is incomplete.

e.g. image

caption: "a view of a city with a building in the"

image

caption: "a view of a city with a view of a river and a"

image

caption: "a woman in a yellow dress walking on a"

It seems the result is truncated.

Hello, how do you get the image with caption? I mean, running 'test.py', only get a series of scores as output. I find captions in variable 'gts' while not matched with images.

Nonmy avatar Sep 25 '21 14:09 Nonmy

我也有同样的问题。您用作输入的 ROI 特征实例的数量是多少?我用的是36。

您好,我想请教您一下,这个对一个图片用模型输出它的描述的代码是哪个py文件,要怎么使用呢?

z972778371 avatar Mar 05 '22 01:03 z972778371

I use https://github.com/peteanderson80/bottom-up-attention/ for feature extraction on my own images, and then run the image caption model, but the result caption is incomplete.

e.g. image

caption: "a view of a city with a building in the"

image

caption: "a view of a city with a view of a river and a"

image

caption: "a woman in a yellow dress walking on a"

It seems the result is truncated.

Hello, I would like to ask you which python file is the code to output the description of a picture with a trained model, and how to use it

z972778371 avatar Mar 05 '22 01:03 z972778371

Hello,

I trained a model with the default parameters and also noticed the same issue. The pretrained model that is available from the link on the description of the repo seems to also produce incomplete captions. I did some digging and I believe that the issue stems from the implementation of optimization using the self critical loss. According to the authors of the self critical loss paper (Appendix section E):

One detail that was crucial to optimizing CIDEr to produce better models was to include the EOS tag as a word. When the EOS word was omitted, trivial sentence fragments such as “with a” and “and a” were dominating the metric gains, despite the `gaming' counter-measures (sentence length and precision clipping) that are included in CIDEr-D [13], which is what we optimized. Including the EOS tag substantially lowers the reward allocated to incomplete sentences, and completely resolved this issue. Another more obvious detail that is important is to associate the reward for the sentence with the first EOS encountered. Omitting the reward from the first EOS fails to reward sentence completion which leads to run-on, and rewarding any words that follow the first EOS token is inconsistent with the decoding procedure.

In my case, I noticed that all incomplete captions were missing a reference to a noun (possibly with an adjective), just like the above examples. From my understanding, it appears that the model is reluctant to produce that noun and the learnt policy indicates that it is better to generate an incomplete caption and receive the adjusted reward rather than make a 'risky' prediction. The solution, just like the authors said, was to simply include the EOS token in both candidate and reference captions

I simply defined the add_eos boolean variable to distinguish between decoding during RL optimization and decoding during evaluation and modified the loop inside tokenizer.py

  # create dictionary for tokenized captions
  for k, line in zip(image_id, lines):
      if not k in tokenized_corpus:
          tokenized_corpus[k] = []
      tokenized_caption = ' '.join([w for w in line.rstrip().split(' ') \
                                    if w not in cls.punctuations])
      if add_eos:
          tokenized_caption += " {}".format(cls.eos_token)
      tokenized_corpus[k].append(tokenized_caption)

I got "AttributeError: type object 'PTBTokenizer' has no attribute 'eos_token'". How can i get 'eos_token'? what else needs to be changed?

DogWealth avatar Mar 21 '22 09:03 DogWealth

Hello, I trained a model with the default parameters and also noticed the same issue. The pretrained model that is available from the link on the description of the repo seems to also produce incomplete captions. I did some digging and I believe that the issue stems from the implementation of optimization using the self critical loss. According to the authors of the self critical loss paper (Appendix section E):

One detail that was crucial to optimizing CIDEr to produce better models was to include the EOS tag as a word. When the EOS word was omitted, trivial sentence fragments such as “with a” and “and a” were dominating the metric gains, despite the `gaming' counter-measures (sentence length and precision clipping) that are included in CIDEr-D [13], which is what we optimized. Including the EOS tag substantially lowers the reward allocated to incomplete sentences, and completely resolved this issue. Another more obvious detail that is important is to associate the reward for the sentence with the first EOS encountered. Omitting the reward from the first EOS fails to reward sentence completion which leads to run-on, and rewarding any words that follow the first EOS token is inconsistent with the decoding procedure.

In my case, I noticed that all incomplete captions were missing a reference to a noun (possibly with an adjective), just like the above examples. From my understanding, it appears that the model is reluctant to produce that noun and the learnt policy indicates that it is better to generate an incomplete caption and receive the adjusted reward rather than make a 'risky' prediction. The solution, just like the authors said, was to simply include the EOS token in both candidate and reference captions I simply defined the add_eos boolean variable to distinguish between decoding during RL optimization and decoding during evaluation and modified the loop inside tokenizer.py

  # create dictionary for tokenized captions
  for k, line in zip(image_id, lines):
      if not k in tokenized_corpus:
          tokenized_corpus[k] = []
      tokenized_caption = ' '.join([w for w in line.rstrip().split(' ') \
                                    if w not in cls.punctuations])
      if add_eos:
          tokenized_caption += " {}".format(cls.eos_token)
      tokenized_corpus[k].append(tokenized_caption)

I got "AttributeError: type object 'PTBTokenizer' has no attribute 'eos_token'". How can i get 'eos_token'? what else needs to be changed?

Hey, you will need to define the eos_token at the definition of the PTBTokenizer. I think that the default eos token used in train.py is '' https://github.com/aimagelab/meshed-memory-transformer/blob/e0fe3fae68091970407e82e5b907cbc423f25df2/train.py#L164

So simply add eos_token = "<eos>" and you should be good to go

gpantaz avatar Mar 21 '22 09:03 gpantaz

Hello, I trained a model with the default parameters and also noticed the same issue. The pretrained model that is available from the link on the description of the repo seems to also produce incomplete captions. I did some digging and I believe that the issue stems from the implementation of optimization using the self critical loss. According to the authors of the self critical loss paper (Appendix section E):

One detail that was crucial to optimizing CIDEr to produce better models was to include the EOS tag as a word. When the EOS word was omitted, trivial sentence fragments such as “with a” and “and a” were dominating the metric gains, despite the `gaming' counter-measures (sentence length and precision clipping) that are included in CIDEr-D [13], which is what we optimized. Including the EOS tag substantially lowers the reward allocated to incomplete sentences, and completely resolved this issue. Another more obvious detail that is important is to associate the reward for the sentence with the first EOS encountered. Omitting the reward from the first EOS fails to reward sentence completion which leads to run-on, and rewarding any words that follow the first EOS token is inconsistent with the decoding procedure.

In my case, I noticed that all incomplete captions were missing a reference to a noun (possibly with an adjective), just like the above examples. From my understanding, it appears that the model is reluctant to produce that noun and the learnt policy indicates that it is better to generate an incomplete caption and receive the adjusted reward rather than make a 'risky' prediction. The solution, just like the authors said, was to simply include the EOS token in both candidate and reference captions I simply defined the add_eos boolean variable to distinguish between decoding during RL optimization and decoding during evaluation and modified the loop inside tokenizer.py

  # create dictionary for tokenized captions
  for k, line in zip(image_id, lines):
      if not k in tokenized_corpus:
          tokenized_corpus[k] = []
      tokenized_caption = ' '.join([w for w in line.rstrip().split(' ') \
                                    if w not in cls.punctuations])
      if add_eos:
          tokenized_caption += " {}".format(cls.eos_token)
      tokenized_corpus[k].append(tokenized_caption)

I got "AttributeError: type object 'PTBTokenizer' has no attribute 'eos_token'". How can i get 'eos_token'? what else needs to be changed?

Hey, you will need to define the eos_token at the definition of the PTBTokenizer. I think that the default eos token used in train.py is ''

https://github.com/aimagelab/meshed-memory-transformer/blob/e0fe3fae68091970407e82e5b907cbc423f25df2/train.py#L164

So simply add eos_token = "<eos>" and you should be good to go

thanks! It's helpful!!!

DogWealth avatar Apr 11 '22 07:04 DogWealth

@gpantaz Hello, I see that this loop is used in the DLCT model, but the description is still incomplete, are your results complete? thanks!

Baixiaobai201619707 avatar Jul 18 '22 07:07 Baixiaobai201619707

@gpantaz Hello, I see that this loop is used in the DLCT model, but the description is still incomplete, are your results complete? thanks!

Hello, sadly I am not aware of the DLCT model. I noticed that I had the same issue with incomplete captions after SCST training, but the above fix worked for me. Is the DLCT model using the same evaluation code? Maybe they use a different end of sequence token?

gpantaz avatar Jul 18 '22 07:07 gpantaz

@gpantaz Hello, I see that this loop is used in the DLCT model, but the description is still incomplete, are your results complete? thanks!

Hello, sadly I am not aware of the DLCT model. I noticed that I had the same issue with incomplete captions after SCST training, but the above fix worked for me. Is the DLCT model using the same evaluation code? Maybe they use a different end of sequence token?

Sorry, I missed some code,’add_eos’ is not used in the DLCT model, it lacks the if statement,thanks! if add_eos: tokenized_caption += " {}".format(cls.eos_token)

Baixiaobai201619707 avatar Jul 18 '22 08:07 Baixiaobai201619707

@gpantaz Hello, I see that this loop is used in the DLCT model, but the description is still incomplete, are your results complete? thanks!

Hello, sadly I am not aware of the DLCT model. I noticed that I had the same issue with incomplete captions after SCST training, but the above fix worked for me. Is the DLCT model using the same evaluation code? Maybe they use a different end of sequence token? image

Hello,May I ask how the specific definition of the variable 'add_eos' is realized? In addition, is the variable eos_token in the above picture the same? Thanks!!!

Baixiaobai201619707 avatar Jul 18 '22 13:07 Baixiaobai201619707

@gpantaz I have a doubt, how do you define 'add_eos'? and I got the error "AttributeError: type object 'PTBTokenizer' has no attribute 'eos_token'", i don't understand how can I solve this from the answers above. THANKS!

YairCCastillo avatar Aug 22 '22 01:08 YairCCastillo

@gpantaz Excuse me, I want to reproduce the visualization results, but I cannot find the corresponding code in this repo. Can you please tell me how to achieve it?

buproof avatar Nov 11 '22 04:11 buproof

你能告诉我如何通过这个图像字幕模型获得标题吗?因为我在我的文件夹中找不到任何字幕输出.... Have you solved it yet

krl24 avatar Apr 24 '23 07:04 krl24

Hello,

I trained a model with the default parameters and also noticed the same issue. The pretrained model that is available from the link on the description of the repo seems to also produce incomplete captions. I did some digging and I believe that the issue stems from the implementation of optimization using the self critical loss. According to the authors of the self critical loss paper (Appendix section E):

One detail that was crucial to optimizing CIDEr to produce better models was to include the EOS tag as a word. When the EOS word was omitted, trivial sentence fragments such as “with a” and “and a” were dominating the metric gains, despite the `gaming' counter-measures (sentence length and precision clipping) that are included in CIDEr-D [13], which is what we optimized. Including the EOS tag substantially lowers the reward allocated to incomplete sentences, and completely resolved this issue. Another more obvious detail that is important is to associate the reward for the sentence with the first EOS encountered. Omitting the reward from the first EOS fails to reward sentence completion which leads to run-on, and rewarding any words that follow the first EOS token is inconsistent with the decoding procedure.

In my case, I noticed that all incomplete captions were missing a reference to a noun (possibly with an adjective), just like the above examples. From my understanding, it appears that the model is reluctant to produce that noun and the learnt policy indicates that it is better to generate an incomplete caption and receive the adjusted reward rather than make a 'risky' prediction. The solution, just like the authors said, was to simply include the EOS token in both candidate and reference captions

I simply defined the add_eos boolean variable to distinguish between decoding during RL optimization and decoding during evaluation and modified the loop inside tokenizer.py

  # create dictionary for tokenized captions
  for k, line in zip(image_id, lines):
      if not k in tokenized_corpus:
          tokenized_corpus[k] = []
      tokenized_caption = ' '.join([w for w in line.rstrip().split(' ') \
                                    if w not in cls.punctuations])
      if add_eos:
          tokenized_caption += " {}".format(cls.eos_token)
      tokenized_corpus[k].append(tokenized_caption)

Great insight! I'm curious about when to set add_eos=True or False. Thanks for answering!

SydCS avatar Aug 19 '23 04:08 SydCS