ocrs icon indicating copy to clipboard operation
ocrs copied to clipboard

Is PDF / DOCX support on the roadmap?

Open wdoppenberg opened this issue 1 year ago • 5 comments
trafficstars

I know this is not trivial since I've been unsuccessful in finding any PDF->image Rust library, but is this something you plan on supporting in the future?

If help is needed please let me know.

wdoppenberg avatar Mar 07 '24 15:03 wdoppenberg

Ocrs could potentially integrate with existing libraries or CLI tools for rendering PDFs somehow. It could also serve as a backend for a project like OCRmyPDF. What use case did you have in mind?

robertknight avatar Mar 07 '24 16:03 robertknight

I would love PDF support as I need to batch load invoice PDF's and extra the text data which can then be saved as JSON to a DB

tomtom215 avatar Apr 30 '24 11:04 tomtom215

@tomtom215 fwiw, you can try to preprocess your pdfs with pdf2image which works pretty well.

woidda avatar May 02 '24 15:05 woidda

If we are looking for sentiment here, I'm also looking for PDF support. I have a mixture of text PDFs and scanned documents in PDF form (aka bitmap PDFs).

I'm ingesting them into embeddings for an AI and I'm planning on trying the pdf-extract crate, and if it fails then fallback to pdf2image piped into this crate.

Hopefully in the future, this could just be done with a single crate.

physics515 avatar Jan 26 '25 02:01 physics515

Rendering a PDF into an image is a complex task which will either involve using platform-specific libraries or compiling large dependencies. Rather than add that into the ocrs crate I think it would make more sense to build libraries on top that can orchestrate the pipeline. In the Tesseract ecosystem, there is OCRmyPDF which does this.

robertknight avatar Jan 26 '25 06:01 robertknight