[Feature Request]: Can we use a VLM to do document parser?

Open sinopec opened this issue 10 months ago • 0 comments

Is there an existing issue for the same feature request?

[x] I have checked the existing issues.

Is your feature request related to a problem?

Unable to accurately parse charts or other documents.

Describe the feature you'd like

Can we use a multimodal large model, such as Qwen2.5-VL, to extract content from images, scanned PDFs, or charts embedded in DOC files? If there is an interface that can be configured, it would be very flexible.

Describe implementation you've considered

No response

Documentation, adoption, use case

Additional information

No response

Feb 28 '25 13:02 sinopec