obsidian-ocr [FR] Indexing Microsoft PowerPoint/Word files

[FR] Indexing Microsoft PowerPoint/Word files

Open khesed opened this issue 1 year ago • 2 comments

Is it possible to extend the algorithm to index other filetypes from Microsoft Office? For example, pptx and docx.

I think there are at least two approach options. The first could be converting pptx and docx files to images for each slide/page and then use OCR on that. This could be done with the unoconv library.

The second would be using an interface that exposes the internal data of those filetypes, like the python-pptx library. This would be more akin to just extending the search function of Obsidian in general, which may be out of the scope of the project. So, I think the first approach might be more reasonable for this project.

Aug 01 '23 18:08 khesed

Interesting idea. This will definitely involve a lot of work. The main problem I see at the moment is parsing the PPTX and DOCX files while only using JS.

Aug 02 '23 14:08 MohrJonas

Yeah, I can see how this can be challenging.

There are some individual libraries in pure JS for each file extension, like js-pptx and js-ppt.

And there are ones which try to do it all, like any-text, but then it's needed to dig through the dependencies to see if it's really pure JS.

Aug 02 '23 17:08 khesed

obsidian-ocr obsidian-ocr copied to clipboard

[FR] Indexing Microsoft PowerPoint/Word files

obsidian-ocr
obsidian-ocr copied to clipboard