open-semantic-search
open-semantic-search copied to clipboard
How to crawl binary files (pdf, word etc) stored in remote website?
Could oss crawl files stored on remote website i.e. dropbox like custom document system? Is it possible to use elastic search or SOLR REST API in backend to index binary files?
Hi, i think am having the same use case here. To crawl a whole website, but not to index everything, but only certain document types, but to follow links on html pages not meant to be indexed. As far as i understand it, at the moment that is not possible, because crawling, that is parsing html and extracting links to follow, and indexing can only be blacklisted together.