ocrs
ocrs copied to clipboard
Is PDF / DOCX support on the roadmap?
I know this is not trivial since I've been unsuccessful in finding any PDF->image Rust library, but is this something you plan on supporting in the future?
If help is needed please let me know.
Ocrs could potentially integrate with existing libraries or CLI tools for rendering PDFs somehow. It could also serve as a backend for a project like OCRmyPDF. What use case did you have in mind?
I would love PDF support as I need to batch load invoice PDF's and extra the text data which can then be saved as JSON to a DB
@tomtom215 fwiw, you can try to preprocess your pdfs with pdf2image which works pretty well.
If we are looking for sentiment here, I'm also looking for PDF support. I have a mixture of text PDFs and scanned documents in PDF form (aka bitmap PDFs).
I'm ingesting them into embeddings for an AI and I'm planning on trying the pdf-extract crate, and if it fails then fallback to pdf2image piped into this crate.
Hopefully in the future, this could just be done with a single crate.
Rendering a PDF into an image is a complex task which will either involve using platform-specific libraries or compiling large dependencies. Rather than add that into the ocrs crate I think it would make more sense to build libraries on top that can orchestrate the pipeline. In the Tesseract ecosystem, there is OCRmyPDF which does this.