Memex
Memex copied to clipboard
Better support for non-latin characters
As described in this community post, non-latin characters (e.g. chinese, japanese) are not parsed correctly and thus are not searchable.
Potential solution: detect language of website, and if non-latin, split all characters before indexing?
Hi @oliversauter , this is not enough, as Chinese-like languages have often several UT8 characters. We should look in the field of natural language processing for existing proven approaches covering the main languages that are not latin based.
In case you may need it. I'd suggest adopting this tool for Chinese Word Segmentation.
https://github.com/yanyiwu/nodejieba
Is this in progress? Seems Chinese still not searchable now.