Using multimodal processor vs. separate tokenizer and image_processor

Open joris-sense opened this issue 1 year ago • 1 comments

The model currently returns a tokenizer separately from an image_processor, as seen e.g. here. The huggingface "preferred way" seems to be to use a multimodal processor that processes text and images at once (see here or the sample code for Idefics3 here). I have been having trouble with this because I am trying to use llava-more for structured text generation with the outlines package, which assumes a single multimodal processor rather than a separate tokenizer and image_processor object (see e.g. this line of code) (though Idefics3 currently doesn't seem to work either because of incompatible inputs to the processor).

Aug 28 '24 07:08 joris-sense

Hi @joris-sense , thank you for your interest in our project!

To solve the problem, you should modify the TransformersVision class, within the outlines repo so that you can handle two different processors.

Feb 27 '25 18:02 federico1-creator