bejean
bejean
In search interface add an option in order to boost recent documents based on the real publish date or the first crawl date
Use elasticsearch as an alternative to Solr. implies : - pipeline mapping stage creation - indexer update - search interface update advantages : - dynamic mapping for better multi-lingual indexing...
Create pipeline stages in order to add NLP features like : - named entities extraction - summarization Look at : - Weka - http://www.cs.waikato.ac.nz/~ml/index.html - OpenNLP - Gate - UIMA
Add a check for deletion period parameter. In order to avoid check for deletion at each crawl. 0 for this parameter disables check for deletion
Some pages have to be rewritten. - http://www.crawl-anywhere.com/configure-a-web-site-to-be-crawled - Done : http://www.crawl-anywhere.com/solr-3-x-or-solr-4-x/
https://groups.google.com/forum/#!topic/crawl-anywhere/s6Bdz2ZW-28
https://groups.google.com/forum/#!topic/crawl-anywhere/B6CNSiWYCzw
Terminated web site crawls remain in crawling list for a very long time.