crawl-anywhere
crawl-anywhere copied to clipboard
Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.
As free IP geolocalisation WS are often unavailable or deprecated, allows easy custom class implementation. http://www.geoiptool.com/ don't provide informations as xml anymore
Add a max pages number option. Should this be the maximum number of pages fetched on the server or the max number of pages sent to the pipeline ? This...
Create a fast recrawl option. This option could allow to recrawl a web site often an quickly by crawling only at a maximum depth of 1 or 2 levels for...
https://groups.google.com/forum/#!topic/crawl-anywhere/tdkJNIjuB5E
see https://groups.google.com/forum/#!topic/crawl-anywhere/3WPCZuwtZCc
see https://groups.google.com/forum/#!topic/crawl-anywhere/3WPCZuwtZCc
According to this message in the forum, implement support for NTLM authentication sheme https://groups.google.com/forum/#!topic/crawl-anywhere/TiAz0rGiIfw
Implement a multi-terms suggester http://wiki.apache.org/solr/Suggester http://blog.trifork.com/2012/02/15/different-ways-to-make-auto-suggestions-with-solr/ At the same time check "did you mean" feature.
- Check logging consistency (verbose / no verbose) - Change the action option in testScript class "To test the meta extraction with the script tool, you need to use the...