Add support for export SigLIP models
Feature request
Add support for export SigLIP models
Motivation
As used by many SOTA VLMs, SigLIP is gaining traction and supporting it can be the step 1 to supporting many VLMs.
Your contribution
Not at the moment
Hi @xenova I see that you have done it already in https://huggingface.co/Xenova/siglip-large-patch16-384, may I know how did you export it since it is not supported in Optimum yet?
Here are my custom configs: https://github.com/xenova/transformers.js/blob/main/scripts/extra/siglip.py. Hope that helps!
thanks. do you know if this can be used with HF pipeline?
The python library? Not too sure. It does work with Transformers.js though. See model card:
Example: Zero-shot image classification w/ Xenova/siglip-large-patch16-384:
import { pipeline } from '@xenova/transformers';
const classifier = await pipeline('zero-shot-image-classification', 'Xenova/siglip-large-patch16-384');
const url = 'http://images.cocodataset.org/val2017/000000039769.jpg';
const output = await classifier(url, ['2 cats', '2 dogs'], {
hypothesis_template: 'a photo of {}',
});
console.log(output);
// [
// { score: 0.4783420264720917, label: '2 cats' },
// { score: 0.00022271279885899276, label: '2 dogs' }
// ]
Example: Compute text embeddings with SiglipTextModel.
import { AutoTokenizer, SiglipTextModel } from '@xenova/transformers';
// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/siglip-large-patch16-384');
const text_model = await SiglipTextModel.from_pretrained('Xenova/siglip-large-patch16-384');
// Run tokenization
const texts = ['a photo of 2 cats', 'a photo of 2 dogs'];
const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });
// Compute embeddings
const { pooler_output } = await text_model(text_inputs);
// Tensor {
// dims: [ 2, 768 ],
// type: 'float32',
// data: Float32Array(1536) [ ... ],
// size: 1536
// }
Example: Compute vision embeddings with SiglipVisionModel.
import { AutoProcessor, SiglipVisionModel, RawImage} from '@xenova/transformers';
// Load processor and vision model
const processor = await AutoProcessor.from_pretrained('Xenova/siglip-large-patch16-384');
const vision_model = await SiglipVisionModel.from_pretrained('Xenova/siglip-large-patch16-384');
// Read image and run processor
const image = await RawImage.read('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/football-match.jpg');
const image_inputs = await processor(image);
// Compute embeddings
const { pooler_output } = await vision_model(image_inputs);
// Tensor {
// dims: [ 1, 768 ],
// type: 'float32',
// data: Float32Array(768) [ ... ],
// size: 768
// }
Alright, thank you
I will still keep this issue open so you or someone else may make a PR to add the config into the repo.
I'd like to take this if no one has picked it up yet @aliencaocao!
Sure, actually I do have a working siglip to tensorrt conversion and inference script using torch2trt, I dont know how much it overlaps with what optimum has/uses. Maybe a maintainer can chip in so we dont rebuild the wheels?
@aliencaocao Optimum only does the conversion of models to ONNX afaik, not TensorRT. So the work we do for this PR would stop just short of the ONNX to TensorRT conversion. That said, I will let another maintainer chime in!
This issue has been marked as stale because it has been open for 30 days with no activity. This thread will be automatically closed in 5 days if no further activity occurs.