Feature Request: Support multimodal extraction (text + image)

Open ojipadeson opened this issue 1 month ago • 1 comments

It would be great if langextract could support multimodal input, allowing users to pass both textual and visual data directly to the model without pre-processing images via VL model. This would leverage modern multimodal models capable of handling image understanding tasks together with text.

Currently, when working with image-rich documents or visual datasets, we need to run VL model to convert images to text before extraction. This loses potential visual context informationand adds extra processing steps. Many recent LLMs can directly accept images alongside text prompts. Supporting this natively in langextract would:

Save processing time
Preserve visual context
Enable richer extraction capabilities from images

Is there any plan to support multimodal extraction (text + image) in langextract?
Do you have any recommended best practices or existing approaches for this scenario?
If there’s no current plan, would you welcome community contributions for such a feature?

Oct 28 '25 11:10 ojipadeson

This would be interesting and I think there are various approaches for combining images and text. If you end up coming up with a proof of concept please do share that here! 😄

Nov 02 '25 11:11 aksg87