LAVIS icon indicating copy to clipboard operation
LAVIS copied to clipboard

BLIP-2 onnx support

Open jethrolow opened this issue 2 years ago • 6 comments

I would like to request support to convert the blip-2 model for onnx conversion.

I have tried to convert the model using torch.onnx.export method but there are issues as the input to the forward method is a dictionary and not a tensor per say.

Would it be possible to provide a script to do this conversion? Or alternatively if the model itself is able to split into vision_model and text_model (which is the case in the huggingface implementation of blip-2), so that the dummy_input to the torch.onnx.export method can be a tensor.

Thanks!

jethrolow avatar Sep 11 '23 18:09 jethrolow

+1

pieceskieran avatar Sep 25 '23 14:09 pieceskieran

Potentially relevant issue: https://github.com/pytorch/pytorch/issues/94280

Infinitay avatar Oct 01 '23 21:10 Infinitay

I have the same request for BLIP-2 in onnx

TeddyAlbina avatar Jan 16 '24 10:01 TeddyAlbina

https://docs.openvino.ai/2022.3/notebooks/233-blip-visual-language-processing-with-output.html

I found it , but its really complicated

Mohammad-Amin-Asadi avatar Mar 25 '24 09:03 Mohammad-Amin-Asadi

@prankshtain @jethrolow here how you can export BLIP to onnx.

#Code from https://huggingface.co/Salesforce/blip-image-captioning-large
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

with torch.no_grad():
    torch.onnx.export(
        model, 
        tuple((inputs["pixel_values"],inputs['input_ids'],inputs['attention_mask'])),
        f="blip_model.onnx",  
        input_names=['pixel_values', 'input_ids','attention_mask'], 
        output_names=['caption'],     
        do_constant_folding=True, 
        opset_version=13, 
    )

mjay2016 avatar Apr 15 '24 20:04 mjay2016

@mjay2016 Hey hi, I too was exploring on converting the blip model to onnx format and I am able to do the conditional caption model's conversion like you suggested.

But I am unable to do the "unconditional caption" type model conversion..

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

unconditional image captioning

inputs = processor(raw_image, return_tensors="pt")

with torch.no_grad(): torch.onnx.export( model, tuple((inputs["pixel_values"])), f="blip_model.onnx",
input_names=['pixel_values', 'input_ids','attention_mask'], output_names=['caption'],
do_constant_folding=True, opset_version=13, )

This is not working..

saiharish97 avatar Aug 08 '24 18:08 saiharish97