BookStack icon indicating copy to clipboard operation
BookStack copied to clipboard

Chinese search cannot find words in the middle of a sentence.

Open jasoncheng7115 opened this issue 6 years ago • 20 comments

For Bug Reports

  • BookStack Version: v0.20.0

When the word I'm looking for is the first word, or there's a space in front of it, it's ok. i01

But if the word is in the middle of a sentence, it cannot be found. i02

Whether this is a full-text retrieval of related issues?

Thanks!

jasoncheng7115 avatar Mar 31 '18 03:03 jasoncheng7115

The same problem in version: v0.25.1, I have just tried BookStack...

alexwyl avatar Mar 04 '19 06:03 alexwyl

The same problem in version: v0.24.3, I use a docker

lotustalk avatar Mar 13 '19 09:03 lotustalk

still the same problem in v26.4. hope it could get solved. thanks

derky1202 avatar Sep 02 '19 04:09 derky1202

you can use "成功" for search, maybe the word segmentation has the bug, hope fix it

sosize avatar Sep 25 '19 07:09 sosize

Confirmed this issue still in v0.27.5 One of my team member is hesitating because of this. Would like to see it fixed.

LeonLiuY avatar Nov 07 '19 08:11 LeonLiuY

Hope fix this issue soon.

hlj avatar Dec 11 '19 06:12 hlj

Sorry about this issue. It essentially stems from my unfamiliarity with non-English text.

At the moment BookStack splits up page content, on certain characters such as spaces and some punctuation, into terms which are put in the database for indexing then a "Starts With" match of those are checked against on a normal search.

As @sosize has mentioned, you can wrap a search in quotes, at which point BookStack will perform a "contains" against the content directly instead of the above "Starts With". This is not the default simply due to performance. ("Starts With" searches can use indexes much more effectively than "Contains").

I'm not really sure how we could utilise the "Starts With" system for such characters. Perhaps the search should default to a "Contains" search if such characters are found in a term?

ssddanbrown avatar Dec 11 '19 22:12 ssddanbrown

@ssddanbrown Can this be set as config control ?Select “Starts With” or “Contains” for search type.

More is hope full-text search.

Or how to quickly modify the code?

sosize avatar Dec 28 '19 03:12 sosize

can i replace all the "startWith" with "contains",or how to modify the source code ,sorry ,i'm a noob

lishuai199502 avatar Apr 02 '20 02:04 lishuai199502

Hi,all the guys,I fixed this problem in v0.28.3.Just add a '%' in SearchService.php. In detail. in \app\Entities\SearchService.php,about line 196. modify $query->orWhere('term', 'like', $inputTerm . '%'); to $query->orWhere('term', 'like', '%'.$inputTerm . '%'); Just try.

lishuai199502 avatar Apr 02 '20 16:04 lishuai199502

@ssddanbrown hi, can above fix be merge to the source?
after modify SearchService.php now I can search both chinese and english in text body.

0x9394 avatar Aug 18 '20 03:08 0x9394

(i'm korean and same problems occur) I know this issue closed, but i'll post some info in the hopes it will help others in the future. My bookstack version: v22.07.03

in\app\Entities\Tools\SearchRunner.php about 222 line and 281 line

※ can find middle term $query->orWhere('term', 'like', $inputTerm . '%'); to $query->orWhere('term', 'like', '%'.$inputTerm . '%');

※ can sort correctly $termQuery->orWhere('term', 'like', $term . '%'); to $termQuery->orWhere('term', 'like', '%'.$term . '%');

chimin-roh avatar Aug 19 '22 14:08 chimin-roh

nice job. thanks

(i'm korean and same problems occur) I know this issue closed, but i'll post some info in the hopes it will help others in the future. My bookstack version: v22.07.03

in\app\Entities\Tools\SearchRunner.php about 222 line and 281 line

※ can find middle term $query->orWhere('term', 'like', $inputTerm . '%'); to $query->orWhere('term', 'like', '%'.$inputTerm . '%');

※ can sort correctly $termQuery->orWhere('term', 'like', $term . '%'); to $termQuery->orWhere('term', 'like', '%'.$term . '%');

derky1202 avatar Sep 17 '22 02:09 derky1202

I've made a PR for to make it configurable in .env

ENHANCE_SEARCH_BAR_COMPATIBILITY=false

Hope I'm making it in the right way

#4393

charlietag avatar Jul 23 '23 04:07 charlietag

For me to properly look at addressing this, it would be useful if people could help me a little in understanding how the languages in question work. Apologies for my naivety on the subject.

  • In the Chinese language, does a single Chinese character generally map to what is a single word in latin based languages?
  • Is a single Chinese character generally the common unit for what would be searched?
  • How would multiple terms be joined in a single query? For example, If I made the search query for orange cat in English, would the equivalent Chinese search query contain a space?
  • How does the above apply for other languages in Asia such as Korean and Japanese?

ssddanbrown avatar Jul 23 '23 10:07 ssddanbrown

Hi @ssddanbrown, thanks for helping to solve non-English languages.

I hope the following will help you to understand what I try to solve

Assume senario like this

Pages

My cat likes to eat orange.
But I want him to drink juice

In chinese, it would be

我的貓喜歡吃橘子
但是我要他喝果汁

Database table (search_terms)

And in normal seaerch mode, the query is designed to be starts with, because each value in table column term only stores one vocabulary. So it's ok in English.

My          | page
cat         | page
likes       | page
to          | page
eat         | page
orange      | page
But         | page
I           | page
want        | page
him         | page
to          | page
drink       | page
juice       | page

In chinese, it would be stored in search_terms like this. And as you can see, column term stores multiple words in one value

我的貓喜歡吃橘子 | page
但是我要他喝果汁 | page

English vs Chinese

My       <---> 我的
cat      <---> 貓
likes to <---> 喜歡
eat      <---> 吃
orange   <---> 橘子
But      <---> 但是
I        <---> 我
want     <---> 要
him      <---> 他
to drink <---> 喝
juice    <---> 果汁

What we actually prefer

But I'm not sure this is a good design for indexing level.

我  | page
的  | page
貓  | page
喜  | page
歡  | page
吃  | page
橘  | page
子  | page
但  | page
是  | page
我  | page
要  | page
他  | page
喝  | page
果  | page
汁  | page

Re-design

I'm not good at indexing area. I have a question that why not just search from pages table using like '%term%'. And let database deal with index thing?

charlietag avatar Jul 23 '23 13:07 charlietag

Normal search

So if we search orange cat, in Chinese, it would be 橘子 貓.

And since Table "search_terms" contains nothing like 橘子 貓, I will get nothing.

And if I search for the following, it will failed:

  • English (failed) - my users like to copy paste to search things...

    • range
    • at
  • Chinese (failed)

What I hope it would be

I hope I can search things like above (failed part)

Exact search

I can use exact search to achieve purpose above.

  • English (success)

    • "range"
    • "at"
  • Chinese (success)

    • "橘"
    • "貓"

But general users will not remeber to add quotes(") when search things

charlietag avatar Jul 23 '23 13:07 charlietag

Thanks for the info @charlietag.

I have a question that why not just search from pages table using like '%term%'. And let database deal with index thing?

The database won't use indexes for queries like that. The search index is specifically built so prefix-based matching can be performed while making use of database indexes. Additionally contains matching in the context of how this are currently built would significantly increase the accidental matches of partial included terms, and therefore impact the scoring. Databases do often have fulltext indexes for "contains" search (Which BookStack used to use) but those have their own complications and there's a reason we moved away from things.

My intention has been to alter how we split the terms for indexing and search, for different character ranges, much like you've suggested, but I just want to better understand how searches and words translate in different languages, hence my last comment.


I would still like to invite others, particularly those using other Asian languages, to answer my previous comment.

ssddanbrown avatar Jul 23 '23 15:07 ssddanbrown

For me to properly look at addressing this, it would be useful if people could help me a little in understanding how the languages in question work. Apologies for my naivety on the subject.

I'm not a language expert. So this answer may not be entirely accurate.

  • In the Chinese language, does a single Chinese character generally map to what is a single word in latin based languages?
In modern Chinese, most words are written with two or more characters.
https://en.wikipedia.org/wiki/Chinese_characters

But there are also some cases where a single character maps to a single Latin word.

i <--> 我
my <--> 我的
myself <--> 我自己 or 我本人 or 本人 or 独自
dog <--> 狗
cloud <-->  云
car <-->  车
  • Is a single Chinese character generally the common unit for what would be searched?

A search for a Chinese character usually does not return useful results. But sometimes people still search for a single Chinese character like "cat“ ”"

Here are some searches recorded by google analytics on my website:

美好的每一天  <--> wonderful everyday(a video game title)
官网  <--> official website
宣传片  <--> promo video
巨构  <--> megastructure
指令  <--> command
文化  <--> culture
新用户  <--> new user
服务器  <--> server
添加  <--> add
猫  <--> cat
个人利益  <--> personal benefit
公共事件  <--> public event
雨  <--> rain
  • How would multiple terms be joined in a single query? For example, If I made the search query for orange cat in English, would the equivalent Chinese search query contain a space?

The words are not separated by spaces in Chinese, Japanese and Korea.
Unlike most languages, Chinese does not use spaces to separate characters into words.

When searching in Chinese, you would not use spaces to separate terms in a query. Instead, you would enter the characters for each term next to each other without spaces.

So usually search engines use a tokenizer to break a sentence into words:

"人人生而自由,在尊严和权利上一律平等"
“人人”, “生而”, “自由”, ",", "在", "尊严", "和", "权力", "上", "一律", "平等"
("all human beings", "born", "free", ",", "in", "dignity", "and", "rights", "on", "all", "equal")

"All human beings are born free and equal in dignity and rights"
"All human beings", "are born", "free", "and", "equal", "in", "digenity", "and", "rights"

In the example of orange cat, it can be an 橘猫 or 橘色猫 or 橘色的猫(orange color's cat).

orange  <--> 橘子(mandarin orange) or 橙子 or 橙色(orange color)
cat  <-->  猫
methoxymethane   <-->  二甲醚 or 甲氧基甲烷

two <--> 二
methyl ether <--> 甲醚

methoxy <--> 甲氧基
methane <--> 甲烷

oxy <--> 氧基
alkyl <--> 烷
`甲` can mean a shell or armor, which is the external protective layer of an animal or a person. In this case, it can be translated as shell or armor

`甲` can mean the first of the ten heavenly stems, which is the first symbol in the cycle of ten celestial stems. In this case, it can be translated as the first of the ten heavenly stems or simply A.

`甲` can mean the first party in a list or a contract, which is the one that comes first. In this case, it can be translated as first (in a list, as a party in a contract etc).



So there seems to be no easy way to segment words.

To be honest, it is very difficult to search Sino-Tibetan languages well. So many applications I have seen choose to use elasticsearch as their Search Engine.

Even in elasticsearch, many people are not satisfied with the official tokenizer and many other tokenizers have been created:

  • https://github.com/medcl/elasticsearch-analysis-ik
  • https://github.com/medcl/elasticsearch-analysis-stconvert
  • https://github.com/medcl/elasticsearch-analysis-pinyin
  • https://github.com/KennFalcon/elasticsearch-analysis-hanlp
  • https://github.com/elastic/elasticsearch-analysis-smartcn

Update: This may be the solution you want. Jieba is a popular (32.7K star) Chinese word segmentation component, and this is its PHP ported version:

  • https://github.com/fukuball/jieba-php
  • https://github.com/cyd622/nlp-jieba

But it seems that jieba consumes a bit of memory, this module is more lightweight

  • https://github.com/hightman/scws

10935336 avatar Jul 24 '23 07:07 10935336

I also couldn't search Chinese words successfully. (English keywords are OK.) I have no experience about it, just guess it could be optimised through something like Asian language parser.

https://docs-develop.pleroma.social/backend/configuration/howto_search_cjk/

https://pgroonga.github.io/

matteotw avatar Apr 23 '24 10:04 matteotw

Version:v24.02.2 I think I solved the problem, Modify the code on line 213 of /var/www/BookStack/app/Search/SearchRunner.php: Before modification:

   210	        $subQuery->where(function (Builder $query) use ($terms) {
   211	            foreach ($terms as $inputTerm) {
   212	                $inputTerm = str_replace('\\', '\\\\', $inputTerm);
   213	                $query->orWhere('term', 'like', $inputTerm . '%');
   214	            }
   215	        });

only one result... image

After modification:

   210	        $subQuery->where(function (Builder $query) use ($terms) {
   211	            foreach ($terms as $inputTerm) {
   212	                $inputTerm = str_replace('\\', '\\\\', $inputTerm);
   213	                $query->orWhere('term', 'like', '%' . $inputTerm . '%');
   214	            }
   215	        });

have seven result! image

kernelry avatar Jul 01 '24 02:07 kernelry