[Feature Request]: OCR capability with DocAgent

Open marklysze opened this issue 8 months ago • 2 comments

Is your feature request related to a problem? Please describe.

DocAgent doesn't have OCR capabilities, and this is definitely needed for PDF, but also could be good for images (so someone can ask about an image).

MistralOCR has a low-cost PDF to markdown endpoint that is quite effective. I've also found Gemini 2.5 Pro to be the best, using pd2image to convert a PDF to images (200dpi) and then MultiModalConversableAgent to convert each page to markdown (then combine together).

Describe the solution you'd like

No response

Additional context

No response

May 13 '25 19:05 marklysze

@qingyun-wu @marklysze @sonichi is this still actual?

Aug 06 '25 20:08 Lancetnik

@marklysze Please help update us on this. Have you added the OCR already?

Aug 09 '25 22:08 qingyun-wu