eliasdb
eliasdb copied to clipboard
Support CJK on Full Text Search
CJK sentences are not separated by spaces. For now eliasdb can't handle an attempt which intended to search a specific word in some sentence in CJK. It would be great to be able to do that.
Hey, I don't have any experience with CJK sentences. Do you have any suggestions on how eliasdb could support this? Maybe a config option for eliasdb.config.json which let's you define a list of "separator" characters?
If we look at the introduction of Ruby in Japanese here: https://www.ruby-lang.org/ja/, we see this:
オープンソースの動的なプログラミング言語で、 シンプルさと高い生産性を備えています。 エレガントな文法を持ち、自然に読み書きができます。
Spaces, nor anything else is used at all to separate the words, We only have the comma 、 and the end of sentence 。. In CJK languages the reader has to find the word boundaries based on grammar or dictionaries. So defining a list of separator characters will not solve this. Rather, EliasDB should be extended to make it possible to look for non-delimited sub strings, something which is generally useful.
Another solution is to use a CJK text segregation library. I just found one for Go:
https://github.com/go-ego/gse
This requires stemming to do CJK
bleve has some of these Gae also looks good