open-semantic-search
open-semantic-search copied to clipboard
More Blacklist exampels
Hello,
I am Indexing a lot of Mails an pdfs. Some pdfs take hours to ocr so I'd like to blacklist them.
I saw the instructions on https://www.opensemanticsearch.org/doc/admin/config/blacklist/ but don't get it running.
Example File is: {'filename': '/home/info/Ratings/Newsletter/Focus Money/2014/FOCUSMONEY_2014-15.pdf', 'additional_plugins': ['enhance_pdf_ocr'], 'config': {'ocr': True}}
I put in the file /blacklist/enhance_pdf_ocr/blacklist-url-prefix
Blacklist of URL Prefixes like domains or paths
/home/info/Ratings/Newsletter/Focus Money/2014
The plan was not to ocr this path. But every time i start the indexing again he starts with the files in the path. More examples for the blacklisting variants would help me a lot - and perhaps other people too.
Best Regards Daniel
I found blacklisting by filetype did not work if there was a space in the path.
Try removing the spaces in the path. e.g., change /Focus Money/ to /Focus_Money/ or /FocusMoney/