crawl-anywhere
crawl-anywhere copied to clipboard
Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.
Some pages have to be rewritten. - http://www.crawl-anywhere.com/configure-a-web-site-to-be-crawled - Done : http://www.crawl-anywhere.com/solr-3-x-or-solr-4-x/
If a html page is returned with character encoding ISO-8859-1 and pipeline runs with system encoding UTF-8, DocTextExtractor produces invalid characters in doExtract. In line 346 if (input==null && rawData!=null)...
https://groups.google.com/forum/#!topic/crawl-anywhere/s6Bdz2ZW-28
Hello, I would like to set an x second crawl delay for sites to avoid hammering the crawled site. So, access the next page on a domain every x second....
Just checked the Web search php application and it seems like an ideal base for a post processing API that could have calls like Get URL contents Get ALL domains...
https://groups.google.com/forum/#!topic/crawl-anywhere/B6CNSiWYCzw
Terminated web site crawls remain in crawling list for a very long time.