Memex icon indicating copy to clipboard operation
Memex copied to clipboard

Better support for non-latin characters

Open blackforestboi opened this issue 7 years ago • 3 comments

As described in this community post, non-latin characters (e.g. chinese, japanese) are not parsed correctly and thus are not searchable.

Potential solution: detect language of website, and if non-latin, split all characters before indexing?

blackforestboi avatar Feb 01 '18 00:02 blackforestboi

Hi @oliversauter , this is not enough, as Chinese-like languages have often several UT8 characters. We should look in the field of natural language processing for existing proven approaches covering the main languages that are not latin based.

bluesun avatar Feb 02 '18 13:02 bluesun

In case you may need it. I'd suggest adopting this tool for Chinese Word Segmentation.
https://github.com/yanyiwu/nodejieba

kehao95 avatar Jan 19 '19 00:01 kehao95

Is this in progress? Seems Chinese still not searchable now.

mmqmzk avatar Jul 29 '20 01:07 mmqmzk