optimum Add support for export SigLIP models

Feature request

Add support for export SigLIP models

Motivation

As used by many SOTA VLMs, SigLIP is gaining traction and supporting it can be the step 1 to supporting many VLMs.

Your contribution

Not at the moment

Jun 06 '24 16:06 aliencaocao

Hi @xenova I see that you have done it already in https://huggingface.co/Xenova/siglip-large-patch16-384, may I know how did you export it since it is not supported in Optimum yet?

Jun 06 '24 16:06 aliencaocao

Here are my custom configs: https://github.com/xenova/transformers.js/blob/main/scripts/extra/siglip.py. Hope that helps!

Jun 06 '24 23:06 xenova

thanks. do you know if this can be used with HF pipeline?

Jun 07 '24 16:06 aliencaocao

The python library? Not too sure. It does work with Transformers.js though. See model card: Example: Zero-shot image classification w/ Xenova/siglip-large-patch16-384:

import { pipeline } from '@xenova/transformers';

const classifier = await pipeline('zero-shot-image-classification', 'Xenova/siglip-large-patch16-384');
const url = 'http://images.cocodataset.org/val2017/000000039769.jpg';
const output = await classifier(url, ['2 cats', '2 dogs'], {
    hypothesis_template: 'a photo of {}',
});
console.log(output);
// [
//   { score: 0.4783420264720917, label: '2 cats' },
//   { score: 0.00022271279885899276, label: '2 dogs' }
// ]

Example: Compute text embeddings with SiglipTextModel.

import { AutoTokenizer, SiglipTextModel } from '@xenova/transformers';

// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/siglip-large-patch16-384');
const text_model = await SiglipTextModel.from_pretrained('Xenova/siglip-large-patch16-384');

// Run tokenization
const texts = ['a photo of 2 cats', 'a photo of 2 dogs'];
const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });

// Compute embeddings
const { pooler_output } = await text_model(text_inputs);
// Tensor {
//   dims: [ 2, 768 ],
//   type: 'float32',
//   data: Float32Array(1536) [ ... ],
//   size: 1536
// }

Example: Compute vision embeddings with SiglipVisionModel.

import { AutoProcessor, SiglipVisionModel, RawImage} from '@xenova/transformers';

// Load processor and vision model
const processor = await AutoProcessor.from_pretrained('Xenova/siglip-large-patch16-384');
const vision_model = await SiglipVisionModel.from_pretrained('Xenova/siglip-large-patch16-384');

// Read image and run processor
const image = await RawImage.read('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/football-match.jpg');
const image_inputs = await processor(image);

// Compute embeddings
const { pooler_output } = await vision_model(image_inputs);
// Tensor {
//   dims: [ 1, 768 ],
//   type: 'float32',
//   data: Float32Array(768) [ ... ],
//   size: 768
// }

Jun 07 '24 23:06 xenova

Alright, thank you

I will still keep this issue open so you or someone else may make a PR to add the config into the repo.

Jun 08 '24 03:06 aliencaocao

I'd like to take this if no one has picked it up yet @aliencaocao!

Jun 27 '24 15:06 bhavika

Sure, actually I do have a working siglip to tensorrt conversion and inference script using torch2trt, I dont know how much it overlaps with what optimum has/uses. Maybe a maintainer can chip in so we dont rebuild the wheels?

Jun 27 '24 15:06 aliencaocao

@aliencaocao Optimum only does the conversion of models to ONNX afaik, not TensorRT. So the work we do for this PR would stop just short of the ONNX to TensorRT conversion. That said, I will let another maintainer chime in!

Jun 27 '24 16:06 bhavika

This issue has been marked as stale because it has been open for 30 days with no activity. This thread will be automatically closed in 5 days if no further activity occurs.

Dec 11 '24 02:12 github-actions[bot]