crawl-anywhere icon indicating copy to clipboard operation
crawl-anywhere copied to clipboard

Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.

Results 38 crawl-anywhere issues
Sort by recently updated
recently updated
newest added

Some pages have to be rewritten. - http://www.crawl-anywhere.com/configure-a-web-site-to-be-crawled - Done : http://www.crawl-anywhere.com/solr-3-x-or-solr-4-x/

enhancement

If a html page is returned with character encoding ISO-8859-1 and pipeline runs with system encoding UTF-8, DocTextExtractor produces invalid characters in doExtract. In line 346 if (input==null && rawData!=null)...

bug

https://groups.google.com/forum/#!topic/crawl-anywhere/s6Bdz2ZW-28

Task

https://groups.google.com/forum/#!topic/crawl-anywhere/pyGVxCwsMOw

Task

Hello, I would like to set an x second crawl delay for sites to avoid hammering the crawled site. So, access the next page on a domain every x second....

question

Just checked the Web search php application and it seems like an ideal base for a post processing API that could have calls like Get URL contents Get ALL domains...

question

https://groups.google.com/forum/#!topic/crawl-anywhere/B6CNSiWYCzw

Task

Terminated web site crawls remain in crawling list for a very long time.

Task