obsidian-ocr
obsidian-ocr copied to clipboard
[FR] Indexing Microsoft PowerPoint/Word files
Is it possible to extend the algorithm to index other filetypes from Microsoft Office? For example, pptx and docx.
I think there are at least two approach options. The first could be converting pptx and docx files to images for each slide/page and then use OCR on that. This could be done with the unoconv
library.
The second would be using an interface that exposes the internal data of those filetypes, like the python-pptx
library. This would be more akin to just extending the search function of Obsidian in general, which may be out of the scope of the project. So, I think the first approach might be more reasonable for this project.
Interesting idea. This will definitely involve a lot of work. The main problem I see at the moment is parsing the PPTX and DOCX files while only using JS.