flatnotes icon indicating copy to clipboard operation
flatnotes copied to clipboard

some chinese character can not be searched

Open 0x9394 opened this issue 3 years ago • 4 comments

to produce, copy some texts in chinese character to notes, like this one: https://m.bjnews.com.cn/detail/154105774414080.html search for 感冒 returns no result. but 订阅 is ok。 version: docker:latest (0c7b3e7aaec0)

search for 感冒 image search for 订阅 image

0x9394 avatar Oct 19 '22 04:10 0x9394

I think the problem here is that the search index is currently very targeted to the English language. When text is added to the index is goes through a number of steps:

Tokenisation - This breaks up the words into "tokens" (words to be indexed). Currently flatnotes uses a regex tokeniser (\w+(\.?\w+)*) which is fairly generic but may/may not work well with Chinese text. Lower Case Filtering - I can imagine this would be ok with any language. Accent Folding - This ensures that "café" is indexed as "cafe". Again, should work ok with any language. Stop Word Removal - This currently filters out common English words that are not useful for searches e.g. 'for', 'from' and 'have'. This should work ok with non-English languages but they wouldn't get the benefit this step is designed for. Stemming - This process tries to "normalise" related words. For exmaple, a note containing one of “render”, “rendered”, “renders” or “rendering” could be found by searching for any of those words. flatnotes currently uses the "Porter" stemming algorithm which is designed to remove suffixes from English words.

I'd love to be able to target other languages but I'd likely need to make some changes to the index and possible add options to it.

dullage avatar Oct 31 '22 11:10 dullage

I, too, would like other languages to be added. Searching in Korean does not bring up any results.

Luxosity avatar Jun 01 '23 01:06 Luxosity

Same issue

kangfenmao avatar Nov 11 '23 09:11 kangfenmao