open-semantic-search icon indicating copy to clipboard operation
open-semantic-search copied to clipboard

How to crawl binary files (pdf, word etc) stored in remote website?

Open vaibhav-s opened this issue 4 years ago • 1 comments

Could oss crawl files stored on remote website i.e. dropbox like custom document system? Is it possible to use elastic search or SOLR REST API in backend to index binary files?

vaibhav-s avatar Aug 03 '20 16:08 vaibhav-s

Hi, i think am having the same use case here. To crawl a whole website, but not to index everything, but only certain document types, but to follow links on html pages not meant to be indexed. As far as i understand it, at the moment that is not possible, because crawling, that is parsing html and extracting links to follow, and indexing can only be blacklisted together.

mmoossen avatar Oct 31 '21 15:10 mmoossen