docs-scraper icon indicating copy to clipboard operation
docs-scraper copied to clipboard

docs-scraper integration with Apache Tika

Open mdraevich opened this issue 3 years ago • 2 comments

Hello, Recently I completed the task to build local search system with the possibilities to index word files / markdown / pdf.

I made this one using nginx autoindex module (customized a bit) and meilisearch (scraper + engine + search bar). Just because docs-scraper do not index word / markdown / pdf files by default, I made some sort of changes:

  1. for markdown files I used markdown2 to convert .md to .html
  2. for word / pdf files I used remote server with Apache Tika in order to convert to .html.

So I'm going to understand are you as developers interested in those changes. If yes, I will do PR. See my code here. To be precise see files custom_downloader_middleware.py & documentation_spider.py

P.S. I do not believe that my code has a great optimization so I'm open for a some sort of criticism :)

mdraevich avatar Mar 16 '22 10:03 mdraevich

Hello @SorryGames thanks for this issue :) I re-open it so that @bidoubiwa or @brunoocasali make a decision regarding it!

Also, your link regarding your code is broken (404)

curquiza avatar Mar 28 '22 13:03 curquiza

Hello @SorryGames thanks for this issue :) I re-open it so that @bidoubiwa or @brunoocasali make a decision regarding it!

Also, your link regarding your code is broken (404)

Hey, link is pointing to the deleted repository (made some kind of clean up) :)

I will touch you a bit later with updated code.

mdraevich avatar Apr 05 '22 08:04 mdraevich

This issue was opened a long time ago and has remained unchanged. So I'm closing it.

alallema avatar Aug 03 '23 11:08 alallema