manticoresearch icon indicating copy to clipboard operation
manticoresearch copied to clipboard

The highlight result in Chinese contains a lot of unnecessary empty spaces

Open washanhanzi opened this issue 2 years ago • 1 comments

The highlight result from searching Chinese document contains a lot of empty spaces.

Reproduce:

  1. create table t(f text) charset_table='chinese' morphology='icu_chinese'; insert into t(f) values('最后,也许你是一名DevOps,其他部门一直尽可能快的往 Elasticsearch 里面灌数据,而你是那个负责防止 Elasticsearch 服务器起火的消防员。只要用户在规则内行事,Elasticsearch集群扩容相当轻松。不过你需要知道如何在进入生产环境前搭建一个稳定的集群,还能 要在凌晨三点钟能识别出警告信号,以防止灾难发生。前面几章你可能不太感兴趣,但这本书的最后一部分是非常重要的,包含所有你需要知道的用以避免系统崩溃的知识。');
  2. http request:
{
	"index": "t",
	"query": {
		"query_string": "警告"
	},
	"highlight": {
		"limit": 50,
		"limit_snippets":5
	}
}
  1. http response:
{
	"took": 0,
	"timed_out": false,
	"hits": {
		"total": 1,
		"total_relation": "eq",
		"hits": [
			{
				"_id": "3460511486852464646",
				"_score": 1500,
				"_source": {
					"f": "最后,也许你是一名DevOps,其他部门一直尽可能快的往 Elasticsearch 里面灌数据,而你是那个负责防止 Elasticsearch 服务器起火的消防员。只要用户在规则内行事,Elasticsearch集群扩容相当轻松。不过你需要知道如何在进入生产环境前搭建一个稳定的集群,还能要在凌晨三点钟能识别出警告信号,以防止灾难发生。前面几章你可能不太感兴趣,但这本书的最后一部分是非常重要的,包含所有你需要知道的用以避免系统崩溃的知识。"
				},
				"highlight": {
					"f": [
						" 凌晨 三点钟 能 识别 出 <b>警告</b> 信号 , 以 防止 灾难 发生"
					]
				}
			}
		]
	}
}

The original text is 凌晨三点钟能识别出警告信号,以防止灾难发生, but the highlight result add a lot of white spaces between words 凌晨 三点钟 能 识别 出 <b>警告</b> 信号 , 以 防止 灾难 发生

washanhanzi avatar Aug 18 '22 05:08 washanhanzi

Simpler MRE

mysql> drop table if exists t; create table t(f text) charset_table='chinese' morphology='icu_chinese'; insert into t(f) values('最后,也许'); select highlight() from t where match('最后');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.48 sec)

--------------
create table t(f text) charset_table='chinese' morphology='icu_chinese'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t(f) values('最后,也许')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
select highlight() from t where match('最后')
--------------

+--------------------------+
| highlight()              |
+--------------------------+
| <b>最后</b> , 也许      |
+--------------------------+
1 row in set (0.01 sec)

The expected result is:

| <b>最后</b>,也许 |

like e.g. with:

select f, highlight() from t where match('ab')
--------------

+--------+---------------+
| f      | highlight()   |
+--------+---------------+
| ab, cd | <b>ab</b>, cd |
+--------+---------------+
1 row in set (0.00 sec)

sanikolaev avatar Aug 18 '22 06:08 sanikolaev

It seems that highlight() returned the separated words for CJK language, not the original text.

axhiao avatar Nov 21 '22 06:11 axhiao

Hi, this bug still exists in the latest version. By blind guess, I suspect it might be the PassageExtractor_c::AddSpaces function in the snippetfunctor.cpp file (this might be wrong). Or could it be some snippet/highlight function that fetches all the tokens from the store then combines them that's doing the space-adding? For non-space separated languages, this bug essentially makes the highlight feature impossible to use (adding the space back would introduce unnecessary ones). Thank you for your effort in making such a great fts database and hope this bug fixed for CJK speakers.

jackyu1996 avatar May 25 '23 09:05 jackyu1996