crawl-anywhere icon indicating copy to clipboard operation
crawl-anywhere copied to clipboard

Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.

Results 38 crawl-anywhere issues
Sort by recently updated
recently updated
newest added

- Allow remove value in target element (https://groups.google.com/forum/#!topic/crawl-anywhere/KmsyjPsw_vA) - check documentation - add unit test

enhancement

Use CloudSolrServer with SolrJ

enhancement

There are several direct dependencies to html parser libraries - jsoup - jericho-html - htmlcleaner Try to keep only jsoup (already used by snacktory)

Task

Redirect to login page in any cases when a session time-out occurs.

enhancement

In order to know the real publish date of a document, use when available the date provided by sitemap files.

enhancement

In search interface add an option in order to boost recent documents based on the real publish date or the first crawl date

enhancement

Can you please add DjVu indexing support? There is a tool like pdftotext available for djvu files: http://djvu.sourceforge.net/doc/man/djvutxt.html I like crawl anywhere, because it is super fast. Sadly I'm not...

enhancement

Use elasticsearch as an alternative to Solr. implies : - pipeline mapping stage creation - indexer update - search interface update advantages : - dynamic mapping for better multi-lingual indexing...

enhancement

Create pipeline stages in order to add NLP features like : - named entities extraction - summarization Look at : - Weka - http://www.cs.waikato.ac.nz/~ml/index.html - OpenNLP - Gate - UIMA

enhancement

Add a check for deletion period parameter. In order to avoid check for deletion at each crawl. 0 for this parameter disables check for deletion

enhancement