semantra icon indicating copy to clipboard operation
semantra copied to clipboard

Support Microsoft Office file formats

Open ellipticview opened this issue 2 years ago • 2 comments

Most of the documents I would like to search are in ppt or pptx format (Powerpoints). Would be nice if Powerpoint and Word documents can be indexed, even without a preview option.

ellipticview avatar Apr 30 '23 08:04 ellipticview

This will be an excellent feature to add.

caojinbo avatar Apr 30 '23 19:04 caojinbo

Looking into Apache Tika for this via tika-python. It does require Java to be installed but seems robust and permissively licensed. Open to another solution that has fewer dependencies, but I haven't found a good one yet

freedmand avatar Apr 30 '23 19:04 freedmand