open-semantic-search icon indicating copy to clipboard operation
open-semantic-search copied to clipboard

More Blacklist exampels

Open hngfngzl opened this issue 2 years ago • 1 comments

Hello,

I am Indexing a lot of Mails an pdfs. Some pdfs take hours to ocr so I'd like to blacklist them.

I saw the instructions on https://www.opensemanticsearch.org/doc/admin/config/blacklist/ but don't get it running.

Example File is: {'filename': '/home/info/Ratings/Newsletter/Focus Money/2014/FOCUSMONEY_2014-15.pdf', 'additional_plugins': ['enhance_pdf_ocr'], 'config': {'ocr': True}}

I put in the file /blacklist/enhance_pdf_ocr/blacklist-url-prefix

Blacklist of URL Prefixes like domains or paths

/home/info/Ratings/Newsletter/Focus Money/2014

The plan was not to ocr this path. But every time i start the indexing again he starts with the files in the path. More examples for the blacklisting variants would help me a lot - and perhaps other people too.

Best Regards Daniel

hngfngzl avatar Mar 13 '22 11:03 hngfngzl

I found blacklisting by filetype did not work if there was a space in the path.

Try removing the spaces in the path. e.g., change /Focus Money/ to /Focus_Money/ or /FocusMoney/

feathered-arch avatar Mar 22 '22 18:03 feathered-arch