eliasdb icon indicating copy to clipboard operation
eliasdb copied to clipboard

Support CJK on Full Text Search

Open 1l0 opened this issue 5 years ago • 4 comments

CJK sentences are not separated by spaces. For now eliasdb can't handle an attempt which intended to search a specific word in some sentence in CJK. It would be great to be able to do that.

1l0 avatar Dec 11 '19 12:12 1l0

Hey, I don't have any experience with CJK sentences. Do you have any suggestions on how eliasdb could support this? Maybe a config option for eliasdb.config.json which let's you define a list of "separator" characters?

krotik avatar Dec 11 '19 19:12 krotik

If we look at the introduction of Ruby in Japanese here: https://www.ruby-lang.org/ja/, we see this:

オープンソースの動的なプログラミング言語で、 シンプルさと高い生産性を備えています。 エレガントな文法を持ち、自然に読み書きができます。

Spaces, nor anything else is used at all to separate the words, We only have the comma 、 and the end of sentence 。. In CJK languages the reader has to find the word boundaries based on grammar or dictionaries. So defining a list of separator characters will not solve this. Rather, EliasDB should be extended to make it possible to look for non-delimited sub strings, something which is generally useful.

beoran avatar Mar 09 '20 07:03 beoran

Another solution is to use a CJK text segregation library. I just found one for Go:

https://github.com/go-ego/gse

beoran avatar Mar 12 '20 07:03 beoran

This requires stemming to do CJK

bleve has some of these Gae also looks good

gedw99 avatar May 18 '21 15:05 gedw99