Memex Better support for non-latin characters

Better support for non-latin characters

Open blackforestboi opened this issue 7 years ago • 3 comments

As described in this community post, non-latin characters (e.g. chinese, japanese) are not parsed correctly and thus are not searchable.

Potential solution: detect language of website, and if non-latin, split all characters before indexing?

Feb 01 '18 00:02 blackforestboi

Hi @oliversauter , this is not enough, as Chinese-like languages have often several UT8 characters. We should look in the field of natural language processing for existing proven approaches covering the main languages that are not latin based.

Feb 02 '18 13:02 bluesun

In case you may need it. I'd suggest adopting this tool for Chinese Word Segmentation.
https://github.com/yanyiwu/nodejieba

Jan 19 '19 00:01 kehao95

Is this in progress? Seems Chinese still not searchable now.

Jul 29 '20 01:07 mmqmzk

Memex Memex copied to clipboard

Better support for non-latin characters

Memex
Memex copied to clipboard