mu icon indicating copy to clipboard operation
mu copied to clipboard

cannot search japanese text longer than 2 chars

Open fabiodl opened this issue 6 years ago • 12 comments

When export XAPIAN_CJK_NGRAM=1 is used before indexing and searching (I tried yes in place of 1 and no changes could be observed) mu find is unable of searching for japanese text longer than 2 chars (mu: no matches for search expression (4) ).

For instance, if "このデータを検索できるのかな” is present, "デー” can be found, but "データ” can not be found.

Behavior confirmed for the following setup: Ubuntu 18.04.1 mu version : 1.3.2 Xapian version : 1.4.11

A quick search online shows that the same happens on completely different architecture and probably versions http://gcg00467.xii.jp/wp/archives/1749

On exactly the same maildir, the same shell (with the same XAPIAN_CJK_NGRAM=1) notmuch correctly indexes and retrieves mails where the query is longer than 2 chars.

fabiodl avatar May 21 '19 23:05 fabiodl

It is a know issue of xapian. I made a workaround by break the cjk strings in queries into bi-grams in my mu4e's addon project mu4e-goodies

panjie avatar Jun 16 '19 14:06 panjie

thank you panjie, I will have a look at it

fabiodl avatar Jun 17 '19 04:06 fabiodl

Can you provide an email message where this happens (with specifically what to search for, since unfortuntely I do not read Japanese)? We could add it as a unit-test.

djcb avatar Nov 08 '21 07:11 djcb

Hi djcb! This is my case. I set XAPIAN_CJK_NGRAM=1 and let us assume that I have four mails those subjects are following.

  1. サーバがダウンしました
  2. スポンサーシップ募集
  3. サービス開始について
  4. ショルダーバック

When I want to find 'サーバ' which means 'server' in Japanese, the correct answer shall be 1. only.

Now I try a) mu find subject:サーバ -> no matches b) mu find subject:サー -> matches 1. 2. and 3. c) mu find subject:サ -> no matches d) mu find subject:ーバ -> matches 1. and 4. e) mu find subject:サー and subject:ーバ -> matches 1. <- BINGO!

So, if I want find the Japanese word which are more than 3 characters, I must divide the word into several 2-grams, then concatenate them with 'and' operators like following.

mu find subject:あいうえお -> NG mu find subject:あい and subject:いう and subject:うえ and subject:えお -> OK... but...

It might be the fundamental solution that xapian introduces Japanese morphological analysis tool like

MeCab: Yet Another Part-of-Speech and Morphological Analyzer. https://taku910.github.io/mecab/

Welcome to janome’s documentation! (English) — Janome v0.4 documentation (en) https://mocobeta.github.io/janome/en/

But if we had some support to divide Japanese word into 2-grams and connect them with ‘and’ operators, we would be more happier!

ychubachi avatar Nov 12 '21 14:11 ychubachi

What does notmuch use? Notmuch is able to deal with cjk

fabiodl avatar Nov 12 '21 14:11 fabiodl

Hi fabiodl! Unfortunately, I have never used notmuch yet. I will check it later.

ychubachi avatar Nov 12 '21 15:11 ychubachi

I sent myself an email and at least for me it works:

$ mu find サーバがダウンしました 
2021-11-12T16:42:37 EET Yoshihide Chubachi <[email protected]> Re: [djcb/mu] cannot search japanese text longer than 2 chars (#1428)
2021-11-12T19:29:45 EET "Dirk-Jan C. Binnema" <[email protected]> サーバがダウンしました

This is with Xapian 1.4.18 (and I'm not even setting XAPIAN_CJK_NGRAM)

djcb avatar Nov 12 '21 17:11 djcb

I'm using:

$ echo $LANG
en_DK.utf8

@ychubachi : are you using a UTF-8 encoding?

djcb avatar Nov 12 '21 17:11 djcb

Hi djcb!

I use UTF-8. But when you do not set XAPIAN_CJK_NGRAM, the situation becomes different.

Because xapian does not know how to tokenize Japanese sentence, it indexes whole of the sentence or something sliced by some obvious delimiters like '、", "。".

The correct tokenized result is expected like サーバ/が/ダウン/しまし/た Please try to find the word 'サーバ' or 'ダウン' only in that case. ('ダウン' means 'down').

ychubachi avatar Nov 13 '21 06:11 ychubachi

I used notmuch and found that Japanese search worked fine when XAPIAN_CJK_NGRAM=1.

I tested the effect of XAPIAN_CJK_NGRAM variable.

$ XAPIAN_CJK_NGRAM= notmuch search subject:サーバ | wc -l
0
$ XAPIAN_CJK_NGRAM=1 notmuch search subject:サーバ | wc -l
537

On the other hand, mu do not seem to be effected by the variable.

$ XAPIAN_CJK_NGRAM= mu find subject:サーバ
error: no matches for search expression
$ XAPIAN_CJK_NGRAM=1 mu find subject:サーバ
error: no matches for search expression

I also so tried simplesearch.rb script at https://xapian.org/docs/bindings/ruby

$ XAPIAN_CJK_NGRAM= ruby simplesearch.rb ~/Maildir/.notmuch/xapian/ サーバ
Parsed query is: Query(サーバ@1)
0 results found.
Matches 1-0:
0
$ XAPIAN_CJK_NGRAM=1 ruby simplesearch.rb ~/Maildir/.notmuch/xapian/ サーバ
Parsed query is: Query((サ@1 AND サー@1 AND ー@1 AND ーバ@1 AND バ@1))
200 results found.
Matches 1-10:
10
1: 100% docid=122280 []
2: 99% docid=135887 []
3: 99% docid=73195 []
4: 99% docid=61144 []
5: 99% docid=61146 []
6: 99% docid=86053 []
7: 99% docid=8456 []
8: 99% docid=44840 []
9: 99% docid=155301 []
10: 99% docid=115241 []

It seems the Japanese word is sliced and combined by xapian library if XAPIAN_CJK_NGRAM=1.

ychubachi avatar Nov 14 '21 15:11 ychubachi

Thanks, that clarifies. I've added a some test cases for this; they do not pass yet, but it gives an automated way to test at least.

djcb avatar Nov 23 '21 08:11 djcb

Thanks a lot!

ychubachi avatar Nov 24 '21 15:11 ychubachi

Good news: mu 1.11.20 (and a little before) can now use Xapian's NGRAM support for this; see the new --support-ngrams option for mu init, and the test_ngrams unit test.

djcb avatar Sep 16 '23 08:09 djcb