manticoresearch icon indicating copy to clipboard operation
manticoresearch copied to clipboard

With fulltext search query across multiple fields but highlight on only one field, highlighting fails

Open regstuff opened this issue 1 year ago • 5 comments

Describe the bug I have an rt index in plain mode with 3 fulltext fields. When my search query includes terms that will match in more than one of those fields, but highlighting is set on only one of those fields, highlighting fails. I am using the python API client. Curl also fails.

MRE test.csv test.csv contains a sample document.

Below python code runs a search twice after inserting this document. Search query includes terms that will match in content and contentid fields. First search includes highlights across both fields and works as expected. Second search asks for highlights in only content field, and highlighting fails even for this field.

Not sure if this is a bug or intentional. If it is intentional, I think it is a bit counter-intuitive. A person would expect whatever matches in the fields specified for highlighting to be highlighted.

import json, csv
import manticoresearch

config = manticoresearch.Configuration(
    host = "http://127.0.0.1:9308"
)

client =  manticoresearch.ApiClient(config)
indexApi = manticoresearch.IndexApi(client)
searchApi = manticoresearch.SearchApi(client)
utilsApi = manticoresearch.UtilsApi(client)

with open('test.csv', 'r', encoding='utf-8', errors='ignore') as f:
    reader = csv.DictReader(f)
    data = list(reader)

print(data)

for i, row in enumerate(data):
    resp = indexApi.insert({"index" : "products", "doc" : row})
print(resp)

queries = ['clarity replace engm050323']

url = 'http://localhost:9308/search'
session = requests.session()

for squery in queries:
    print(squery)
    resp = searchApi.search({"index":"products", "query":{"query_string":f"{squery}"},"highlight":{"fields":["content", "contentid"], "before_match": '<span class="match">', "after_match": '</span>', "limit": 0, "encoder": "default"}})
    print(resp)
    resp = searchApi.search({"index":"products", "query":{"query_string":f"{squery}"},"highlight":{"fields":["content"], "before_match": '<span class="match">', "after_match": '</span>', "limit": 0, "encoder": "default"}})
    print(resp)

regstuff avatar Aug 23 '22 11:08 regstuff

Just testing this out, recoded this as SQL, which I find easier for testing

RT>CREATE TABLE products (title text, content text, contentid text);
Query OK, 0 rows affected (0.006 sec)

RT>INSERT INTO products (title,content,contentid) VALUES ('Whether you will live joyfully or not should not be subject to anything because this is an inward thing','<p>So do not try to replace your clarity or lack of clarity with confidence</p>','engm050323');
Query OK, 1 row affected (0.002 sec)

RT>SELECT id,HIGHLIGHT({},'content,contentid') FROM products WHERE MATCH('clarity replace engm050323');
+---------------------+--------------------------------------------------------------------------------------------------------------------------+
| id                  | highlight({},'content,contentid')                                                                                        |
+---------------------+--------------------------------------------------------------------------------------------------------------------------+
| 4325207677463429121 | <p>So do not try to <b>replace</b> your <b>clarity</b> or lack of <b>clarity</b> with confidence</p> | <b>engm050323</b> |
+---------------------+--------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.001 sec)

RT>SELECT id,HIGHLIGHT({},'content') FROM products WHERE MATCH('clarity replace engm050323');
+---------------------+---------------------------------------------------------------------------------+
| id                  | highlight({},'content')                                                         |
+---------------------+---------------------------------------------------------------------------------+
| 4325207677463429121 | <p>So do not try to replace your clarity or lack of clarity with confidence</p> |
+---------------------+---------------------------------------------------------------------------------+
1 row in set (0.001 sec)

This is a fundamental limitation on how highlight works. the content field itself does not match the whole query. In general should always highlight the same fields as the query matches (which is by default all fields!)

In SQL at least, can use the SNIPPET() function instead, which allows specifying a different query for the snippet. Then can specify a query that doesn't have to match all words

RT>SELECT id,SNIPPET(content,'clarity|replace|engm050323') FROM  products WHERE MATCH('clarity replace engm050323');
+---------------------+------------------------------------------------------------------------------------------------------+
| id                  | snippet(content,'clarity|replace|engm050323')                                                        |
+---------------------+------------------------------------------------------------------------------------------------------+
| 4325207677463429121 | <p>So do not try to <b>replace</b> your <b>clarity</b> or lack of <b>clarity</b> with confidence</p> |
+---------------------+------------------------------------------------------------------------------------------------------+
1 row in set (0.001 sec)

barryhunter avatar Aug 23 '22 12:08 barryhunter

One 'workaround' is to specifally formulate the query to allow the HIGHLIGHT to work, eg allow it to match fields with any of the words. But then might need to add extra criteria, if really only want documents that match all words

Again its SQL, but this should be convertible to the python API (wrapper around HTTP api) - still uses venilla HIGHLIGHT functionality.

RT>SELECT id,WEIGHT() w,HIGHLIGHT({},'content') FROM  products WHERE MATCH('"clarity replace engm050323"/1') AND w=3 OPTION ranker=expr('doc_word_count');
+---------------------+------+------------------------------------------------------------------------------------------------------+
| id                  | w    | highlight({},'content')                                                                              |
+---------------------+------+------------------------------------------------------------------------------------------------------+
| 4325207677463429121 |    3 | <p>So do not try to <b>replace</b> your <b>clarity</b> or lack of <b>clarity</b> with confidence</p> |
+---------------------+------+------------------------------------------------------------------------------------------------------+
1 row in set (0.001 sec)

But only works if you want simple queries, if start adding expressions, its gets more complicated. Plus in this simplified example are forgoing normal ranking, as you use ranking formula to still require all words.

barryhunter avatar Aug 23 '22 12:08 barryhunter

Ah, just found highlight_query

import re

   highlight_query = re.sub(r"\s+","|",squery)
   resp = searchApi.search({"index":"products", "query":{"query_string":f"{squery}"},"highlight":{"fields":["content"], "before_match": '<span class="match">', "after_match": '</span>', "limit": 0, "encoder": "default", "highlight_query":f"{highlight_query}"}})
    

just using the re lib to replace whitespace with OR. Again only copes with very simple queries.

barryhunter avatar Aug 23 '22 12:08 barryhunter

So here are my 2 cents:

SQL highlight() command and its JSON alias "highlight":{} by default work the way they try to highlight the results following the full-text query. I.e. in a general case when you don't specify fields to highlight the highlight is based on your full-text query, for example:

image image

but if you specify the fields to highlight it works differently: it highlights only if the full-text query matches the selected fields.

image image

Perhaps there should be another mode which would NOT change the behaviour when you specify fields. We'll discuss it.

sanikolaev avatar Aug 24 '22 05:08 sanikolaev

@barryhunter @sanikolaev I see your points. But I think this behavior should at least be in the documentation.

EDIT: Have created a new issue for the below points. Would like to highlight (no pun intended) another issue with highlighting. When there are several text fields (39 in my case), highlighting seems to break if a search term is found in the text fields declared towards the end of the config.

Here are two example indexes which are exactly the same, except that in one, the contentid field is declared first and the titleid field last. In the other, it is the opposite.

In the index where contentid field is declared first, searches including terms found in contentid field highlight correctly, but searches with terms from titleid field break. Note that the highlighting does not happen in any field, and not just the titleid field.

test.csv Have attached an example test.csv. Python file to test MRE is below. Sorry, not very good with SQL, so sticking with the python api.

Config of the two indexes is also given below.

import csv
import manticoresearch
import requests

config = manticoresearch.Configuration(
    host = "http://127.0.0.1:9308"
)

client =  manticoresearch.ApiClient(config)
indexApi = manticoresearch.IndexApi(client)
searchApi = manticoresearch.SearchApi(client)
utilsApi = manticoresearch.UtilsApi(client)

with open('test.csv', 'r', encoding='utf-8', errors='ignore') as f:
    reader = csv.DictReader(f)
    data = list(reader)

print(data)

for i, row in enumerate(data):
    resp = indexApi.insert({"index" : "titleidfirst", "doc" : row})
    print(resp)
    resp = indexApi.insert({"index" : "contentidfirst", "doc" : row})
    print(resp)


queries = ['medieval', 'medieval (@contentid engi1)', 'medieval (@titleid engt1)']

url = 'http://localhost:9308/search'
session = requests.session()
for ixname in ['titleidfirst', 'contentidfirst']:
    print(ixname)
    for squery in queries:
        print(squery)
        data = {"index":ixname, "query":{"query_string":f"{squery}"},"highlight":{"limit": 0, "encoder": "default"}}
        resp = session.post(url, json=data)
        print(resp.json(strict=False))

Config of the indexes

index contentidfirst {
  charset_table = non_cjk
  type = rt
  path = D:\Code\manticore\indexdir\contentidfirst\contentidfirst
  rt_field = contentid
  rt_field = content
  rt_field = title
  rt_field = collection
  rt_field = sgcontent
  rt_field = access
  rt_field = jira
  rt_field = type
  rt_field = info
  rt_field = transcript
  rt_field = eventtype
  rt_field = published
  rt_field = spublished
  rt_field = language
  rt_field = event
  rt_field = status
  rt_field = child
  rt_field = parent
  rt_field = ppublished
  rt_field = cpublished
  rt_field = tag
  rt_field = summary
  rt_field = speaker
  rt_field = evententities
  rt_field = aboutentities
  rt_field = markup
  rt_field = attachment
  rt_field = tkeyword
  rt_field = mkeyword
  rt_field = venue
  rt_field = location
  rt_field = city
  rt_field = state
  rt_field = country
  rt_field = area
  rt_field = mstatus
  rt_field = markuper
  rt_field = userperm
  rt_field = titleid

  rt_mem_limit = 128M 

  preopen = 1
  min_infix_len = 2
  html_strip = 1
  index_sp = 1
  bigram_index = both_freq
  bigram_freq_words = a, am, an, and, are, as, at, be, but, by, can, did, do, for, i, if, in, is, it, its, no, not, of, on, or, so, to, was
}

index titleidfirst {
  charset_table = non_cjk
  type = rt
  path = D:\Code\manticore\indexdir\titleidfirst\titleidfirst
  rt_field = titleid
  rt_field = content
  rt_field = title
  rt_field = collection
  rt_field = sgcontent
  rt_field = access
  rt_field = jira
  rt_field = type
  rt_field = info
  rt_field = transcript
  rt_field = eventtype
  rt_field = published
  rt_field = spublished
  rt_field = language
  rt_field = event
  rt_field = status
  rt_field = child
  rt_field = parent
  rt_field = ppublished
  rt_field = cpublished
  rt_field = tag
  rt_field = summary
  rt_field = speaker
  rt_field = evententities
  rt_field = aboutentities
  rt_field = markup
  rt_field = attachment
  rt_field = tkeyword
  rt_field = mkeyword
  rt_field = venue
  rt_field = location
  rt_field = city
  rt_field = state
  rt_field = country
  rt_field = area
  rt_field = mstatus
  rt_field = markuper
  rt_field = userperm
  rt_field = contentid

  rt_mem_limit = 128M 

  preopen = 1
  min_infix_len = 2
  html_strip = 1
  index_sp = 1
  bigram_index = both_freq
  bigram_freq_words = a, am, an, and, are, as, at, be, but, by, can, did, do, for, i, if, in, is, it, its, no, not, of, on, or, so, to, was
}

regstuff avatar Aug 24 '22 12:08 regstuff