atomic-server icon indicating copy to clipboard operation
atomic-server copied to clipboard

Extract text from imported (PDF / Word / Office) files

Open joepio opened this issue 3 years ago • 1 comments

Being able to search inside the PDF files uploaded to Atomic Server would be a really nice addition.

Goals:

  • Make it easier to find PDF documents by searching for terms that occur inside them
  • Lightweight
  • Fast
  • Runs in background, may fail. Should not slow down upload process.
  • OCR, if missing in the original PDF, would be a decent addition. But only if other goals are met.
  • Bonus points if it also turns other doc types (e.g. docx) to plaintext
  • Output should be plaintext or (preferably) markdown

Non-goals:

  • Extract data from tables in PDFs

There are some tools that could help with this:

joepio avatar Aug 15 '22 11:08 joepio

https://crates.io/crates/pdfium-render - new contender. Recently recommended in rust reddit.

AlexMikhalev avatar Feb 27 '24 13:02 AlexMikhalev