OFA icon indicating copy to clipboard operation
OFA copied to clipboard

How to get a fine-tuned image-caption huggingface-version OFA model ?

Open PhoebusSi opened this issue 1 year ago • 19 comments

How could I get the fined-tuned image-caption OFA model of huggingface version, which had topped the MSCOCO Image Caption Leaderboard ?

PhoebusSi avatar Jul 19 '22 13:07 PhoebusSi

Or can we obatin the results (reported in the paper, topped the COCO Learderboard) by using the current OFA-huge (released on huggingface, in https://huggingface.co/OFA-Sys/OFA-huge) without finetuning ?

PhoebusSi avatar Jul 19 '22 13:07 PhoebusSi

Sorry for my late response. To reach a similar score on the leaderboard, you should use the finetuned checkpoint. We have already released the huge ckpt of captioning for the original code, and we'll soon release the one for HF transformers, I guess this week.

JustinLin610 avatar Jul 25 '22 02:07 JustinLin610

Is it possible to load the finetuned checkpoints using the huggingface model definition (e.g. from transformers import OFAModel)? The uploaded files only include the model checkpoint, not the model config. I tried copying over the config from OFA-base, but got some warnings about uninitialized layers and the generated captions are gibberish. Thanks!

steve-marmalade avatar Aug 15 '22 15:08 steve-marmalade

No, the model definition is not the same as HF's, you basically start from a randomly initialised network.

NohTow avatar Aug 24 '22 08:08 NohTow

Hi @JustinLin610, congrats on the awesome work.

and we'll soon release the one for HF transformers,

Do you have any any update in this?

@NohTow:

No, the model definition is not the same as HF's,

So does it mean that they are artifacts of uncompatible architectures, or is it possible to map parameter names from the fairseq version to HF Transformers version?

I'd like to help with it if it's possible.

monatis avatar Sep 09 '22 16:09 monatis

I think it should be possible to translate the fairseq version to an HF one (since I guess that is what they did for the other regular versions) but I am not sure nor I know how to do this.

NohTow avatar Sep 09 '22 17:09 NohTow

Ok I quickly had a look at it and for the base-sized checkpoint ~100 parameters out of 965 have different names. I'll try to make a conversion tomorrow --it's definitely not a job for Friday night 😬

monatis avatar Sep 09 '22 19:09 monatis

I managed to convert weights from Fairseq version to Transformers-compatible one. Here's a PR on HF Hub for the base version --I'm also making another PR for the large size shortly.

@JustinLin610

monatis avatar Sep 10 '22 10:09 monatis

There's no HF model repo for OFA-large-caption, so I couldn't make a PR. Instead, I uploaded the converted model to my storage for public download.

monatis avatar Sep 10 '22 11:09 monatis

Very cool, thank you very much ! I’ll test it on monday and try to reproduce the paper results.

Converting models weights always seems a bit tricky but doable, do you mind sharing some ressources/insights on how you did this ? It has to do with named parameters, but if you have any useful inputs on that, it would be awesome !

But again, thank you already for doing it for this model !

NohTow avatar Sep 10 '22 15:09 NohTow

I’ll test it on monday and try to reproduce the paper results.

Cool. I'd like to hear about the results of your tests as well. I'll also share the converted huge-sized captioning model --it's the model that topped the COCO Learderboard.

if you have any useful inputs on that, it would be awesome

Sure. I'll share my code with some explanatory notes on Monday.

monatis avatar Sep 10 '22 16:09 monatis

And here comes the huge-sized image-captioning model. Download, zipped, 2.39 GB.

monatis avatar Sep 12 '22 08:09 monatis

And here's the Colab notebook with explanations that I used for conversion.

monatis avatar Sep 12 '22 09:09 monatis

And here's the Colab notebook with explanations that I used for conversion.

Thank you very much, I'll bookmark it so I know how to do this kind of translation in the future !

I managed to convert weights from Fairseq version to Transformers-compatible one. Here's a PR on HF Hub for the base version --I'm also making another PR for the large size shortly.

So I tried using these weights (each version, from base to huge) in my own loop based on HF examples and it seems like the model is correctly loaded:

[INFO|modeling_utils.py:1703] 2022-09-12 14:51:32,183 >> All model checkpoint weights were used when initializing OFAModel.

[INFO|modeling_utils.py:1711] 2022-09-12 14:51:32,183 >> All the weights of OFAModel were initialized from the model checkpoint at /nfs/nas4.irisa.fr/deepfakes/deepfakes/image_repurposing/multimodal-fake-news-detection/VisualNews/OFA/weights/OFA-base-caption.
If your task is similar to the task the model of the checkpoint was trained on, you can already use OFAModel for predictions without further training.

Also, the resulting model seems to be able to generate ok-ish captions, so I guess we can confirm that weights are indeed well imported (would result in total gibberish otherwise).

For example, it generates "<s> an airplane is parked on the runway at an airport}})\\)\\)\\)\\)\\)\\</s>" when one reference is "the airplane has landed behind a fence with barbed wire" so the model definitely works as intended. Yet, at you might have noticed, at the end of the caption, it often (nearly always ?) generate gibberish before the eos token.

I remember this issue https://github.com/OFA-Sys/OFA/issues/172 that seems very related. However, it should not happened on an "official version" of the model, so it's weird. Note that this only happen with tuned version, not the "official HF versions" available that are not finetuned for captioning.

Edit : I saw in your notebook that you use a resolution (patches size) of 480. Yet I find 224 here : https://github.com/OFA-Sys/OFA/blob/a7b0805d36efbc61c923d635ae75a5840c165f29/data/mm_data/caption_dataset.py and 256 here : https://github.com/OFA-Sys/OFA/blob/feature/add_transformers/transformers.md. Is there any reason for that ?

NohTow avatar Sep 12 '22 12:09 NohTow

generate gibberish before the eos token

Yes, I also noticed it. I haven't checked whether the authors did something particular for it, but I postprocessed my caption result as in the following to get rid of that unwanted characters:

import re

result = re.sub(r'[^\w\s]', '', result).strip()

I observed this phenomenon in various other Seq2Seq models (i.e., jibberish artifact at the end of the sequence), and it's easy to get rid of them, so I didn't bother 😄

Note that this only happen with tuned version, not the "official HF versions"

The issue mentions about a fix for the captioning task, so it might be that the captioning checkpoint is exported before this fix. But this is only a guess --maybe we can a confirmation from the authors.

you use a resolution (patches size) of 480

In their HF Spaces app, the author set that value to cfg.task.patch_image_size (here on line 53), and it's defined as 480 by default here on line 62. Using a resolution of 480 makes the inference run longer, but a lower resolution lead to missing some details such as small or rare objects and colors. So I chose to stick to the value of 480 because it seems to yield more elaborate captions.

monatis avatar Sep 12 '22 16:09 monatis

I observed this phenomenon in various other Seq2Seq models (i.e., jibberish artifact at the end of the sequence), and it's easy to get rid of them, so I didn't bother

Yeah sure we can filter it out, but it still means that the model tries to output this, which is not really a good behavior and I guess it is hurting the performance in the end.

The issue mentions about a fix for the captioning task, so it might be that the captioning checkpoint is exported before this fix. But this is only a guess --maybe we can a confirmation from the authors.

Yes, but again, it would be really strange that a model that output gibberish in the end will be SOTA imho, especially since it seems to be directly related to the finetuning for this particular task. Maybe the filtering is enough to get good results and the learning is not hurt but I find it odd.

In their HF Spaces app, the author set that value to cfg.task.patch_image_size (here on line 53), and it's defined as 480 by default here on line 62. Using a resolution of 480 makes the inference run longer, but a lower resolution lead to missing some details such as small or rare objects and colors. So I chose to stick to the value of 480 because it seems to yield more elaborate captions.

Very well, thank you ! Maybe it would be nice to know the exact experimental setup if we want to reproduce the results. Such models can handle different resolutions because it will just produce more tokens, but the amount of information contained in a patch is different depending on the resizing. Given the assumption that you are evaluating on data that is somehow similar in the format of your training data, applying the same resizing should be better (since the model has been trained on this kind of information quantity).

Edit: My bad, it is specified in the paper :

For the image processing, we first resize and crop the images into different resolutions, 256 × 256 for OFATiny and
OFAMedium, 384 × 384 for OFABase, 480 × 480 for OFALarge and OFAHuge, with a fixed patch size of 16 × 16. Note
that training OFALarge and OFAHuge are time and computation consuming, we first train them with images of the
resolution of 384 × 384 and 256 × 256, and continue pretraining with images of the resolution of 480 × 480

NohTow avatar Sep 12 '22 16:09 NohTow

I guess it is hurting the performance in the end

Don't think so. As previously stated, most of the Seq2Seq models have this behavior. I also observed it in my TTS research. A high-quality TTS model I trained always produced one-second noise at the end of the synthesized speech, and I was simply trimming it out.

I also verified that the authors use postprocessing to remove them.


I observed something interesting with following image: boxing

HD model produced the following with use_cache=True: "a man is standing next to a red punching bag". However, it produced the following more detailed caption with use_cache=False: "a man wearing a helmet and boxing gloves standing next to a red punching bag". Interestingly, the beam search decoder in Fairseq produced exactly the latter with use_cache=True, so it's faster and more elaborate. Again, this has nothing to do with the model itself, but it's related to how the beam search decoding is implemented. For some reason, we need to set use_cache=False in HF to match the quality of the Fairseq beam search decoding, thus ending up with 3-to-4x slower inference.

monatis avatar Sep 13 '22 13:09 monatis

Don't think so. As previously stated, most of the Seq2Seq models have this behavior. I also observed it in my TTS research. A high-quality TTS model I trained always produced one-second noise at the end of the synthesized speech, and I was simply trimming it out.

I also verified that the authors use postprocessing to remove them.

Well, I evaluated the model with your regex to trim the bad ending and got these results:

- base-caption : 'eval_bleu': 0.3787352658847889
- large-caption : 'eval_bleu': 0.4061869408828353,
- huge-caption : eval_bleu': 0.4294927572391054

So results seems ok-ish so yeah it should do the trick, but they are still a bit different from reported ones. I'll try with the trimming used by the authors and double check that I have the exact same data as the paper (as well as my pipeline).

Interestingly, the beam search decoder in Fairseq produced exactly the latter with use_cache=True, so it's faster and more elaborate. Again, this has nothing to do with the model itself, but it's related to how the beam search decoding is implemented. For some reason, we need to set use_cache=False in HF to match the quality of the Fairseq beam search decoding, thus ending up with 3-to-4x slower inference.

Very interesting. I'll make sure to use use_cache=False for my attempt to reproduce results. Are you sure that this is due to how beam search is implemented or it might be coming from something wrong in the cached hidden states ? I mean, is it worth it for me to dig into this and debug or it is the expected behavior ?

Also, do you know which decoding method/parameters are used in the paper for image captioning ? It is mentioned for NLU but I can't find it for captioning.

Edit: When using beam search with num_beams=5, no_repeat_ngram_size=3, use_cache=False I get these results :

  • base-caption, 'eval_bleu': 0.42015316099762207
  • large-caption, 'eval_bleu': 0.4375964024334991
  • huge-caption, 'eval_bleu': 0.44117037945606136

Which is much closer to reported results ! Note that with use_cache=True, the base-caption model achieve only 'eval_bleu': 0.30064348597736223. By the way @JustinLin610, do you plan to release fine-tuned checkpoint for tiny and medium model ? Would be cool for people that want to further train these models and do not have access to a lot of GPUs.

NohTow avatar Sep 13 '22 13:09 NohTow

Maybe you can try the repository,I succeed to use transformers to train OFA model and inference: https://github.com/yangjianxin1/OFA-Chinese

yangjianxin1 avatar Feb 17 '23 06:02 yangjianxin1