BLIP-2 onnx support
I would like to request support to convert the blip-2 model for onnx conversion.
I have tried to convert the model using torch.onnx.export method but there are issues as the input to the forward method is a dictionary and not a tensor per say.
Would it be possible to provide a script to do this conversion? Or alternatively if the model itself is able to split into vision_model and text_model (which is the case in the huggingface implementation of blip-2), so that the dummy_input to the torch.onnx.export method can be a tensor.
Thanks!
+1
Potentially relevant issue: https://github.com/pytorch/pytorch/issues/94280
I have the same request for BLIP-2 in onnx
https://docs.openvino.ai/2022.3/notebooks/233-blip-visual-language-processing-with-output.html
I found it , but its really complicated
@prankshtain @jethrolow here how you can export BLIP to onnx.
#Code from https://huggingface.co/Salesforce/blip-image-captioning-large
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")
with torch.no_grad():
torch.onnx.export(
model,
tuple((inputs["pixel_values"],inputs['input_ids'],inputs['attention_mask'])),
f="blip_model.onnx",
input_names=['pixel_values', 'input_ids','attention_mask'],
output_names=['caption'],
do_constant_folding=True,
opset_version=13,
)
@mjay2016 Hey hi, I too was exploring on converting the blip model to onnx format and I am able to do the conditional caption model's conversion like you suggested.
But I am unable to do the "unconditional caption" type model conversion..
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")
with torch.no_grad():
torch.onnx.export(
model,
tuple((inputs["pixel_values"])),
f="blip_model.onnx",
input_names=['pixel_values', 'input_ids','attention_mask'],
output_names=['caption'],
do_constant_folding=True,
opset_version=13,
)
This is not working..