BLIP Fine tune BLIP Image Captioning to custom dataset

Hi, thanks for your amaizing work, i'm enjoy to use BLIP which demonstrate impressive results:) Now i have a question: how can i fine tune BLIP for Image Captioning task on custom dataset?

My dataset consists of categories, each of which has pictures for this category, an example of categories: Chimneys, pedestrian crossings and more, I don’t have text captions for pictures, only the name of the categories, I can implement my plan? I studied your article on arxiv but could not find the answer to my question.

Mar 22 '22 17:03 MikeMACintosh

Hi, to finetune BLIP's image captioning model on a custom dataset, you can prepare your annotation file in a similar format as the coco captioning file (coco_karpathy_train.json), and create your own dataset following coco_karpathy_dataset.py.

Mar 22 '22 23:03 LiJunnan1992

Thank's for your answer, i will try this way.

Mar 28 '22 14:03 MikeMACintosh

@LiJunnan1992 – appreciate all your engagement in the discussion! I'm also interested in fine-tuning BLIP on a proprietary dataset in an effort to generate "contextual captions" and have a few questions :)

Would it be possible / make sense to fine-tune BLIP with variable prompt prefix? That is, instead of "a picture of " as a constant prompt prefix, I'd use a different prompt prefix for each image, incorporating contextual information I have about each image. For example, I might do something like "a picture from Vegan Delights website of ", and my hope would be that the text output would begin to reflect the content of the contextual prompt – for example by returning "a plate of vegan food" instead of "a plate of food"
If this does make sense, are there any limits to the prompt prefix length I should be aware of? I've tried to track this down and it seems (from the bert-base-uncased model card on huggingface) that the limit might be 512 tokens? "The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens."
If I took this approach, do you have any idea how many images I'd need to use in fine-tuning? I understand that more is better, but wondering if you have any rough guidance. OpenAI, for example, suggests 1000 examples for fine-tuning GPT3 – a basic rule of thumb like that would be super helpful.

Thanks again!

Jun 13 '22 15:06 labenz

Hi @labenz, thanks for your question.

Yes it is possible to use variable prompt.
The maximum number of tokens that BLIP accepts is the same as BERT (512 tokens). However, BLIP is pretrained mostly on short sentences. To reduce memory cost, we have hard-coded the maximum text length as 40 (https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip.py#L110), but you can change it to other values.
It is hard for me to say how many samples are enough. Please also note that BLIP's text decoder is much much much smaller than GPT-3, so the "prompt magic" may not work as well in BLIP.

Jun 14 '22 01:06 LiJunnan1992

thanks very much for the feedback – appreciate it!!

Jun 15 '22 20:06 labenz

@labenz Did you ever have and success with this approach? Looking to do a similar project.

Apr 28 '23 20:04 ConorDoyle314

We ended up using a different approach, which used BLIP image-text matching instead of captioning.

(For context, our problem was “image selection”, so we found that generating “ideal captions” and then selecting images by ITM was more effective than selecting by caption, and this seemed likely to be true even if we had fine tuned, especially because our images are extremely diverse)

On Fri, Apr 28, 2023 at 4:35 PM ConorDoyle314 @.***> wrote:

@labenz https://github.com/labenz Did you ever have and success with this approach? Looking to do a similar project.

— Reply to this email directly, view it on GitHub https://github.com/salesforce/BLIP/issues/37#issuecomment-1528064629, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB73EXDEGFOWF5FCPNNXBTXDQSYZANCNFSM5RLRO76Q . You are receiving this because you were mentioned.Message ID: @.***>

Apr 28 '23 20:04 labenz

Hi @labenz, thanks for your question.

Yes it is possible to use variable prompt.

The maximum number of tokens that BLIP accepts is the same as BERT (512 tokens). However, BLIP is pretrained mostly on short sentences. To reduce memory cost, we have hard-coded the maximum text length as 40 (https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip.py#L110 ), but you can change it to other values.

It is hard for me to say how many samples are enough. Please also note that BLIP's text decoder is much much much smaller than GPT-3, so the "prompt magic" may not work as well in BLIP.

Thanks for the replies, regrading to your answer 2. If I would like to fine tune BLIP model, but my text file is far more than 512 tokens, is there any solution to this without retrain BLIP (edit text length==40).

Thanks.

Jun 17 '23 19:06 FJGEODEV

Hi,

We do have a notebook on that here in case you'd like to fine-tune the Hugging Face version of BLIP: https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb.

We also have a notebook using PEFT (LoRa): https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb. This is more memory efficient since you only train a couple of linear projection layers, while keeping the model itself frozen.

Aug 06 '23 19:08 NielsRogge

How big does the fine-tuning dataset need to be in order to have good performance for the image caption model?

Aug 07 '23 03:08 pjerryhu

I would start with a couple of hundred but as always, the more the better.

Aug 07 '23 07:08 NielsRogge

Hi, to finetune BLIP's image captioning model on a custom dataset, you can prepare your annotation file in a similar format as the coco captioning file (coco_karpathy_train.json), and create your own dataset following coco_karpathy_dataset.py.

hi @NielsRogge @LiJunnan1992 checking to see if it possible to do on custom images... i have seen BLIP works good on plain english terms but some terms are geographic specific and are not identified while using BLIP? Love to see if there is a sample notebook for training on custom images , as mentioned above,"you can prepare your annotation file in a similar format as the coco captioning file (coco_karpathy_train.json), and create your own dataset following coco_karpathy_dataset.py."

Aug 25 '23 11:08 andysingal

Hi,

You can create a custom image captioning dataset as follows: https://huggingface.co/docs/datasets/image_dataset#image-captioning

Aug 25 '23 11:08 NielsRogge

Hi,

You can create a custom image captioning dataset as follows: https://huggingface.co/docs/datasets/image_dataset#image-captioning

Hey All, checking to see if someone has experience annotating geographic specific images to create a custom BLIP model. Current BLIP model is good for American/British English terms. For example if i have a pic of pyramid in the Egypt, it has no way to detect it within the captions? . Looking forward to hear. Here are some images for reference: https://drive.google.com/drive/folders/1J_XGR6aKyxS0fLNgrs61zB30GEnPvgdh?usp=drive_link @LiJunnan1992 @NielsRogge

Aug 25 '23 11:08 andysingal

Hi,

You can create a custom image captioning dataset as follows: https://huggingface.co/docs/datasets/image_dataset#image-captioning

yes, i am trying it out, thanks @NielsRogge

Aug 26 '23 04:08 andysingal

Hello,

I am trying to generate a detailed description of an image using blip model and it has to be more then 200 words. For example:

Is it possible to do using the same model and if not then what could be the other options that I can explore?

Sep 14 '23 05:09 SuryaPrakash0201

I'm afraid BLIP is not able to read text from image in such a detailed way. I'd recommend taking a look at Pix2Struct which uses a much higher image resolution to be able to read such text.

Sep 14 '23 06:09 NielsRogge

Hello everyone,

I tried to use BLIP to generate image captions for images from surgical procedures. However, the generated captions are repetitions of words like a person, the, an etc. My dataset consists of hundreds of photos that belong to 44 different text descriptions. The learning rate is 0.000001. How can this be solved?

Thank you in advance!

Jan 12 '24 11:01 katekats

你好，

如果您想微调BLIP 的Hugging Face 版本，我们确实有一个笔记本：https: //github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb。

我们还有一个使用 PEFT (LoRa) 的笔记本：https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb。这可以提高内存效率，因为您只训练几个线性投影层，同时保持模型本身冻结。

Hello! How can I use my own image text dataset to fine tune the BLIP2 model. The task I need to perform is the image captioning task. I have found that using pre trained BLIP2 alone to generate text descriptions for my images does not work well. I would like to fine tune my dataset first before performing the captioning operation? May I ask how to implement the specific operation and which pre trained model can be fine tuned to achieve better results? Looking forward to your reply, thank you again! Good luck to you!

Mar 15 '24 07:03 shams2023