manticoresearch
manticoresearch copied to clipboard
The highlight result in Chinese contains a lot of unnecessary empty spaces
The highlight result from searching Chinese document contains a lot of empty spaces.
Reproduce:
-
create table t(f text) charset_table='chinese' morphology='icu_chinese'; insert into t(f) values('最后,也许你是一名DevOps,其他部门一直尽可能快的往 Elasticsearch 里面灌数据,而你是那个负责防止 Elasticsearch 服务器起火的消防员。只要用户在规则内行事,Elasticsearch集群扩容相当轻松。不过你需要知道如何在进入生产环境前搭建一个稳定的集群,还能 要在凌晨三点钟能识别出警告信号,以防止灾难发生。前面几章你可能不太感兴趣,但这本书的最后一部分是非常重要的,包含所有你需要知道的用以避免系统崩溃的知识。');
- http request:
{
"index": "t",
"query": {
"query_string": "警告"
},
"highlight": {
"limit": 50,
"limit_snippets":5
}
}
- http response:
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"total_relation": "eq",
"hits": [
{
"_id": "3460511486852464646",
"_score": 1500,
"_source": {
"f": "最后,也许你是一名DevOps,其他部门一直尽可能快的往 Elasticsearch 里面灌数据,而你是那个负责防止 Elasticsearch 服务器起火的消防员。只要用户在规则内行事,Elasticsearch集群扩容相当轻松。不过你需要知道如何在进入生产环境前搭建一个稳定的集群,还能要在凌晨三点钟能识别出警告信号,以防止灾难发生。前面几章你可能不太感兴趣,但这本书的最后一部分是非常重要的,包含所有你需要知道的用以避免系统崩溃的知识。"
},
"highlight": {
"f": [
" 凌晨 三点钟 能 识别 出 <b>警告</b> 信号 , 以 防止 灾难 发生"
]
}
}
]
}
}
The original text is 凌晨三点钟能识别出警告信号,以防止灾难发生
, but the highlight result add a lot of white spaces between words 凌晨 三点钟 能 识别 出 <b>警告</b> 信号 , 以 防止 灾难 发生
Simpler MRE
mysql> drop table if exists t; create table t(f text) charset_table='chinese' morphology='icu_chinese'; insert into t(f) values('最后,也许'); select highlight() from t where match('最后');
--------------
drop table if exists t
--------------
Query OK, 0 rows affected (0.48 sec)
--------------
create table t(f text) charset_table='chinese' morphology='icu_chinese'
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
insert into t(f) values('最后,也许')
--------------
Query OK, 1 row affected (0.00 sec)
--------------
select highlight() from t where match('最后')
--------------
+--------------------------+
| highlight() |
+--------------------------+
| <b>最后</b> , 也许 |
+--------------------------+
1 row in set (0.01 sec)
The expected result is:
| <b>最后</b>,也许 |
like e.g. with:
select f, highlight() from t where match('ab')
--------------
+--------+---------------+
| f | highlight() |
+--------+---------------+
| ab, cd | <b>ab</b>, cd |
+--------+---------------+
1 row in set (0.00 sec)
It seems that highlight() returned the separated words for CJK language, not the original text.
Hi, this bug still exists in the latest version. By blind guess, I suspect it might be the PassageExtractor_c::AddSpaces
function in the snippetfunctor.cpp
file (this might be wrong). Or could it be some snippet/highlight function that fetches all the tokens from the store then combines them that's doing the space-adding? For non-space separated languages, this bug essentially makes the highlight feature impossible to use (adding the space back would introduce unnecessary ones). Thank you for your effort in making such a great fts database and hope this bug fixed for CJK speakers.