ocrs Is PDF / DOCX support on the roadmap?

Is PDF / DOCX support on the roadmap?

Open wdoppenberg opened this issue 1 year ago • 5 comments

trafficstars

I know this is not trivial since I've been unsuccessful in finding any PDF->image Rust library, but is this something you plan on supporting in the future?

If help is needed please let me know.

Mar 07 '24 15:03 wdoppenberg

Ocrs could potentially integrate with existing libraries or CLI tools for rendering PDFs somehow. It could also serve as a backend for a project like OCRmyPDF. What use case did you have in mind?

Mar 07 '24 16:03 robertknight

I would love PDF support as I need to batch load invoice PDF's and extra the text data which can then be saved as JSON to a DB

Apr 30 '24 11:04 tomtom215

@tomtom215 fwiw, you can try to preprocess your pdfs with pdf2image which works pretty well.

May 02 '24 15:05 woidda

If we are looking for sentiment here, I'm also looking for PDF support. I have a mixture of text PDFs and scanned documents in PDF form (aka bitmap PDFs).

I'm ingesting them into embeddings for an AI and I'm planning on trying the pdf-extract crate, and if it fails then fallback to pdf2image piped into this crate.

Hopefully in the future, this could just be done with a single crate.

Jan 26 '25 02:01 physics515

Rendering a PDF into an image is a complex task which will either involve using platform-specific libraries or compiling large dependencies. Rather than add that into the ocrs crate I think it would make more sense to build libraries on top that can orchestrate the pipeline. In the Tesseract ecosystem, there is OCRmyPDF which does this.

Jan 26 '25 06:01 robertknight

ocrs ocrs copied to clipboard

Is PDF / DOCX support on the roadmap?

ocrs
ocrs copied to clipboard