crate icon indicating copy to clipboard operation
crate copied to clipboard

Add analyzer for Chinese or mixed Chinese-English text [was: crate can add lucence's analysis plugin]

Open loredp opened this issue 2 years ago • 6 comments

Problem Statement

crate can add lucence's analysis plugin?

edit: Analyzing Chinese or mixed Chinese-English texts with CrateDB

Possible Solutions

edit: Add Lucene analyzers for Chinese and mixed Chinese-English

Considered Alternatives

edit: only external solutions might be applicable

loredp avatar Aug 18 '22 11:08 loredp

Hi @loredp

Welcome to crate/crate!

Could you be more explicit what you mean with "lucene analysis plugin"? To you maybe have a link?

CrateDB does support all kinds of Lucene analyzers out of the box. https://crate.io/docs/crate/reference/en/5.0/general/ddl/analyzers.html

proddata avatar Aug 18 '22 11:08 proddata

Thank you! like this : https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html#analysis-smartcn-install

loredp avatar Aug 19 '22 01:08 loredp

Looks like this is what they want added: https://github.com/elastic/elasticsearch/tree/main/plugins/analysis-smartcn

robd003 avatar Aug 19 '22 18:08 robd003

Looks like this is what they want added: https://github.com/elastic/elasticsearch/tree/main/plugins/analysis-smartcn

it means crate supported this plugin?

loredp avatar Aug 22 '22 01:08 loredp

@loredp Elasticsearch plugins are not officially supported but may work with small modifications. Our current TokenizerFactory API is based on Elasticsearch v6.8, such only analysis plugins for this version are working. But CrateDB's lucene version advanced a lot since then, CrateDB 5.0.x uses 9.2.0 while the analysis-smartcn 6.8.23expects lucene 7.7.3, so its not guaranteed to work. A possible solution to that is to built the plugin on you own using the related lucene analyzer dependency.

For the smartcn analyzer following steps may work, at least creating a table using the analyzer succeeded ;)

  1. Download the plugin:
wget https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-smartcn/analysis-smartcn-6.8.23.zip
  1. Extract it to a plugin folder inside at plugins/:
unzip analysis-smartcn-6.8.23.zip -d crate-5.0.0/plugins/es-analysis-smartcn
  1. Adjust the plugin's descriptor, replace Elasticsearch version with CrateDB's one.
sed -i.bak 's/elasticsearch.version=6.8.23/cratedb.version=5.0.0/;' plugins/es-analysis-smartcn/plugin-descriptor.properties
  1. Restart CrateDB

seut avatar Aug 31 '22 16:08 seut

@loredp We are also welcoming any contribution or maybe you want to create and host a CrateDB variant of that plugin on your own.

seut avatar Aug 31 '22 16:08 seut