crawl-anywhere issues

Fieldmapping stage enhancement

- Allow remove value in target element (https://groups.google.com/forum/#!topic/crawl-anywhere/KmsyjPsw_vA) - check documentation - add unit test

bejean

enhancement

Optimize indexer for SolrCloud

Use CloudSolrServer with SolrJ

bejean

enhancement

Keep direct dependency to only one html parser library

There are several direct dependencies to html parser libraries - jsoup - jericho-html - htmlcleaner Try to keep only jsoup (already used by snacktory)

bejean

Task

Better handle session time-out in admin interface

Redirect to login page in any cases when a session time-out occurs.

bejean

enhancement

Use document date provided by sitemap files

In order to know the real publish date of a document, use when available the date provided by sitemap files.

bejean

enhancement

Boost recent indexed documents

In search interface add an option in order to boost recent documents based on the real publish date or the first crawl date

bejean

enhancement

add DjVu support

6

Can you please add DjVu indexing support? There is a tool like pdftotext available for djvu files: http://djvu.sourceforge.net/doc/man/djvutxt.html I like crawl anywhere, because it is super fast. Sadly I'm not...

ghost

enhancement

elasticsearch intergration

Use elasticsearch as an alternative to Solr. implies : - pipeline mapping stage creation - indexer update - search interface update advantages : - dynamic mapping for better multi-lingual indexing...

bejean

enhancement

NLP tools integration

Create pipeline stages in order to add NLP features like : - named entities extraction - summarization Look at : - Weka - http://www.cs.waikato.ac.nz/~ml/index.html - OpenNLP - Gate - UIMA

bejean

enhancement

Check for deletion period

Add a check for deletion period parameter. In order to avoid check for deletion at each crawl. 0 for this parameter disables check for deletion

bejean

enhancement

crawl-anywhere
crawl-anywhere copied to clipboard

Metadata

Fieldmapping stage enhancement

Optimize indexer for SolrCloud

Keep direct dependency to only one html parser library

Better handle session time-out in admin interface

Use document date provided by sitemap files

Boost recent indexed documents

add DjVu support

elasticsearch intergration

NLP tools integration

Check for deletion period

← Metadata

Owner

Metadata

crawl-anywhere crawl-anywhere copied to clipboard

Metadata

← Metadata

Owner

Metadata

crawl-anywhere
crawl-anywhere copied to clipboard