Using multimodal processor vs. separate tokenizer and image_processor
The model currently returns a tokenizer separately from an image_processor, as seen e.g. here. The huggingface "preferred way" seems to be to use a multimodal processor that processes text and images at once (see here or the sample code for Idefics3 here). I have been having trouble with this because I am trying to use llava-more for structured text generation with the outlines package, which assumes a single multimodal processor rather than a separate tokenizer and image_processor object (see e.g. this line of code) (though Idefics3 currently doesn't seem to work either because of incompatible inputs to the processor).
Hi @joris-sense , thank you for your interest in our project!
To solve the problem, you should modify the TransformersVision class,
within the outlines repo
so that you can handle two different processors.