docs-scraper
docs-scraper copied to clipboard
docs-scraper integration with Apache Tika
Hello, Recently I completed the task to build local search system with the possibilities to index word files / markdown / pdf.
I made this one using nginx autoindex module (customized a bit) and meilisearch (scraper + engine + search bar). Just because docs-scraper do not index word / markdown / pdf files by default, I made some sort of changes:
- for markdown files I used markdown2 to convert
.mdto.html - for word / pdf files I used remote server with Apache Tika in order to convert to
.html.
So I'm going to understand are you as developers interested in those changes. If yes, I will do PR. See my code here. To be precise see files custom_downloader_middleware.py & documentation_spider.py
P.S. I do not believe that my code has a great optimization so I'm open for a some sort of criticism :)
Hello @SorryGames thanks for this issue :) I re-open it so that @bidoubiwa or @brunoocasali make a decision regarding it!
Also, your link regarding your code is broken (404)
Hello @SorryGames thanks for this issue :) I re-open it so that @bidoubiwa or @brunoocasali make a decision regarding it!
Also, your link regarding your code is broken (404)
Hey, link is pointing to the deleted repository (made some kind of clean up) :)
I will touch you a bit later with updated code.
This issue was opened a long time ago and has remained unchanged. So I'm closing it.