manticoresearch icon indicating copy to clipboard operation
manticoresearch copied to clipboard

Highlighting proximity terms in html fails when html tags are between words to highlight

Open regstuff opened this issue 1 year ago • 1 comments

Describe the bug Doing a proximity search with NEAR operator returns the correct results, but when highlighting with html_strip = retain, highlighting can fail because highlighter seems to be counting html tags as words, and the highlighters proximity test is failing.

[MRE] Below python code will create two identical docs, one with html tags in between the two words, and one without. Two searches are run, one with NEAR/2 and other with NEAR/5. NEAR/2 fails to highlight because of 2 extra html tags between the words. NEAR/5 works

ixname = 'products'
row = {'title': '<p>sentence and word.</p>', 'contentid': '1'}
resp = indexApi.insert({"index" : ixname, "doc" : row})

row = {'title': '<p><b><i>sentence</i></b> and word.</p>', 'contentid': '2'}
resp = indexApi.insert({"index" : ixname, "doc" : row})
print(resp)

resp = utilsApi.sql('SELECT *, HIGHLIGHT({before_match=\'[match]\', after_match=\'[/match]\', limit=0, html_strip_mode=\'retain\'}, \'\') FROM products WHERE MATCH(\'sentence NEAR/2 word\')')
print(resp)
resp = utilsApi.sql('SELECT *, HIGHLIGHT({before_match=\'[match]\', after_match=\'[/match]\', limit=0, html_strip_mode=\'retain\'}, \'\') FROM products WHERE MATCH(\'sentence NEAR/5 word\')')
print(resp)

regstuff avatar Sep 08 '22 12:09 regstuff

MRE in SQL format

mysql> drop table if exists t; create table t(f text) html_strip='1'; insert into t values(1, '<p>sentence and word.</p>'),(2,'<p><b>sentence</b> and word.</p>'); select * from t; select highlight({html_strip_mode=retain,limit=0,before_match='|',after_match='|'}) from t where match('sentence NEAR/2 word'); select highlight({html_strip_mode=retain,limit=0,before_match='|',after_match='|'}) from t where match('sentence NEAR/3 word');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t(f text) html_strip='1'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t values(1, '<p>sentence and word.</p>'),(2,'<p><b>sentence</b> and word.</p>')
--------------

Query OK, 2 rows affected (0.00 sec)

--------------
select * from t
--------------

+------+----------------------------------+
| id   | f                                |
+------+----------------------------------+
|    1 | <p>sentence and word.</p>        |
|    2 | <p><b>sentence</b> and word.</p> |
+------+----------------------------------+
2 rows in set (0.00 sec)

--------------
select highlight({html_strip_mode=retain,limit=0,before_match='|',after_match='|'}) from t where match('sentence NEAR/2 word')
--------------

+------------------------------------------------------------------------------+
| highlight({html_strip_mode=retain,limit=0,before_match='|',after_match='|'}) |
+------------------------------------------------------------------------------+
| <p>|sentence| and |word|.</p>                                                |
| <p><b>sentence</b> and word.</p>                                             |
+------------------------------------------------------------------------------+
2 rows in set (0.01 sec)

--------------
select highlight({html_strip_mode=retain,limit=0,before_match='|',after_match='|'}) from t where match('sentence NEAR/3 word')
--------------

+------------------------------------------------------------------------------+
| highlight({html_strip_mode=retain,limit=0,before_match='|',after_match='|'}) |
+------------------------------------------------------------------------------+
| <p>|sentence| and |word|.</p>                                                |
| <p><b>|sentence|</b> and |word|.</p>                                         |
+------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

Expected:

select highlight({html_strip_mode=retain,limit=0,before_match='|',after_match='|'}) from t where match('sentence NEAR/2 word')

+------------------------------------------------------------------------------+
| highlight({html_strip_mode=retain,limit=0,before_match='|',after_match='|'}) |
+------------------------------------------------------------------------------+
| <p>|sentence| and |word|.</p>                                                |
| <p><b>|sentence|</b> and |word|.</p>                                         |
+------------------------------------------------------------------------------+

sanikolaev avatar Sep 14 '22 07:09 sanikolaev