manticoresearch
manticoresearch copied to clipboard
With fulltext search query across multiple fields but highlight on only one field, highlighting fails
Describe the bug I have an rt index in plain mode with 3 fulltext fields. When my search query includes terms that will match in more than one of those fields, but highlighting is set on only one of those fields, highlighting fails. I am using the python API client. Curl also fails.
MRE test.csv test.csv contains a sample document.
Below python code runs a search twice after inserting this document. Search query includes terms that will match in content and contentid fields. First search includes highlights across both fields and works as expected. Second search asks for highlights in only content field, and highlighting fails even for this field.
Not sure if this is a bug or intentional. If it is intentional, I think it is a bit counter-intuitive. A person would expect whatever matches in the fields specified for highlighting to be highlighted.
import json, csv
import manticoresearch
config = manticoresearch.Configuration(
host = "http://127.0.0.1:9308"
)
client = manticoresearch.ApiClient(config)
indexApi = manticoresearch.IndexApi(client)
searchApi = manticoresearch.SearchApi(client)
utilsApi = manticoresearch.UtilsApi(client)
with open('test.csv', 'r', encoding='utf-8', errors='ignore') as f:
reader = csv.DictReader(f)
data = list(reader)
print(data)
for i, row in enumerate(data):
resp = indexApi.insert({"index" : "products", "doc" : row})
print(resp)
queries = ['clarity replace engm050323']
url = 'http://localhost:9308/search'
session = requests.session()
for squery in queries:
print(squery)
resp = searchApi.search({"index":"products", "query":{"query_string":f"{squery}"},"highlight":{"fields":["content", "contentid"], "before_match": '<span class="match">', "after_match": '</span>', "limit": 0, "encoder": "default"}})
print(resp)
resp = searchApi.search({"index":"products", "query":{"query_string":f"{squery}"},"highlight":{"fields":["content"], "before_match": '<span class="match">', "after_match": '</span>', "limit": 0, "encoder": "default"}})
print(resp)
Just testing this out, recoded this as SQL, which I find easier for testing
RT>CREATE TABLE products (title text, content text, contentid text);
Query OK, 0 rows affected (0.006 sec)
RT>INSERT INTO products (title,content,contentid) VALUES ('Whether you will live joyfully or not should not be subject to anything because this is an inward thing','<p>So do not try to replace your clarity or lack of clarity with confidence</p>','engm050323');
Query OK, 1 row affected (0.002 sec)
RT>SELECT id,HIGHLIGHT({},'content,contentid') FROM products WHERE MATCH('clarity replace engm050323');
+---------------------+--------------------------------------------------------------------------------------------------------------------------+
| id | highlight({},'content,contentid') |
+---------------------+--------------------------------------------------------------------------------------------------------------------------+
| 4325207677463429121 | <p>So do not try to <b>replace</b> your <b>clarity</b> or lack of <b>clarity</b> with confidence</p> | <b>engm050323</b> |
+---------------------+--------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.001 sec)
RT>SELECT id,HIGHLIGHT({},'content') FROM products WHERE MATCH('clarity replace engm050323');
+---------------------+---------------------------------------------------------------------------------+
| id | highlight({},'content') |
+---------------------+---------------------------------------------------------------------------------+
| 4325207677463429121 | <p>So do not try to replace your clarity or lack of clarity with confidence</p> |
+---------------------+---------------------------------------------------------------------------------+
1 row in set (0.001 sec)
This is a fundamental limitation on how highlight works. the content field itself does not match the whole query. In general should always highlight the same fields as the query matches (which is by default all fields!)
In SQL at least, can use the SNIPPET() function instead, which allows specifying a different query for the snippet. Then can specify a query that doesn't have to match all words
RT>SELECT id,SNIPPET(content,'clarity|replace|engm050323') FROM products WHERE MATCH('clarity replace engm050323');
+---------------------+------------------------------------------------------------------------------------------------------+
| id | snippet(content,'clarity|replace|engm050323') |
+---------------------+------------------------------------------------------------------------------------------------------+
| 4325207677463429121 | <p>So do not try to <b>replace</b> your <b>clarity</b> or lack of <b>clarity</b> with confidence</p> |
+---------------------+------------------------------------------------------------------------------------------------------+
1 row in set (0.001 sec)
One 'workaround' is to specifally formulate the query to allow the HIGHLIGHT to work, eg allow it to match fields with any of the words. But then might need to add extra criteria, if really only want documents that match all words
Again its SQL, but this should be convertible to the python API (wrapper around HTTP api) - still uses venilla HIGHLIGHT functionality.
RT>SELECT id,WEIGHT() w,HIGHLIGHT({},'content') FROM products WHERE MATCH('"clarity replace engm050323"/1') AND w=3 OPTION ranker=expr('doc_word_count');
+---------------------+------+------------------------------------------------------------------------------------------------------+
| id | w | highlight({},'content') |
+---------------------+------+------------------------------------------------------------------------------------------------------+
| 4325207677463429121 | 3 | <p>So do not try to <b>replace</b> your <b>clarity</b> or lack of <b>clarity</b> with confidence</p> |
+---------------------+------+------------------------------------------------------------------------------------------------------+
1 row in set (0.001 sec)
But only works if you want simple queries, if start adding expressions, its gets more complicated. Plus in this simplified example are forgoing normal ranking, as you use ranking formula to still require all words.
Ah, just found highlight_query
import re
highlight_query = re.sub(r"\s+","|",squery)
resp = searchApi.search({"index":"products", "query":{"query_string":f"{squery}"},"highlight":{"fields":["content"], "before_match": '<span class="match">', "after_match": '</span>', "limit": 0, "encoder": "default", "highlight_query":f"{highlight_query}"}})
just using the re lib to replace whitespace with OR. Again only copes with very simple queries.
So here are my 2 cents:
SQL highlight() command and its JSON alias "highlight":{} by default work the way they try to highlight the results following the full-text query. I.e. in a general case when you don't specify fields to highlight the highlight is based on your full-text query, for example:
but if you specify the fields to highlight it works differently: it highlights only if the full-text query matches the selected fields.
Perhaps there should be another mode which would NOT change the behaviour when you specify fields. We'll discuss it.
@barryhunter @sanikolaev I see your points. But I think this behavior should at least be in the documentation.
EDIT: Have created a new issue for the below points. Would like to highlight (no pun intended) another issue with highlighting. When there are several text fields (39 in my case), highlighting seems to break if a search term is found in the text fields declared towards the end of the config.
Here are two example indexes which are exactly the same, except that in one, the contentid field is declared first and the titleid field last. In the other, it is the opposite.
In the index where contentid field is declared first, searches including terms found in contentid field highlight correctly, but searches with terms from titleid field break. Note that the highlighting does not happen in any field, and not just the titleid field.
test.csv Have attached an example test.csv. Python file to test MRE is below. Sorry, not very good with SQL, so sticking with the python api.
Config of the two indexes is also given below.
import csv
import manticoresearch
import requests
config = manticoresearch.Configuration(
host = "http://127.0.0.1:9308"
)
client = manticoresearch.ApiClient(config)
indexApi = manticoresearch.IndexApi(client)
searchApi = manticoresearch.SearchApi(client)
utilsApi = manticoresearch.UtilsApi(client)
with open('test.csv', 'r', encoding='utf-8', errors='ignore') as f:
reader = csv.DictReader(f)
data = list(reader)
print(data)
for i, row in enumerate(data):
resp = indexApi.insert({"index" : "titleidfirst", "doc" : row})
print(resp)
resp = indexApi.insert({"index" : "contentidfirst", "doc" : row})
print(resp)
queries = ['medieval', 'medieval (@contentid engi1)', 'medieval (@titleid engt1)']
url = 'http://localhost:9308/search'
session = requests.session()
for ixname in ['titleidfirst', 'contentidfirst']:
print(ixname)
for squery in queries:
print(squery)
data = {"index":ixname, "query":{"query_string":f"{squery}"},"highlight":{"limit": 0, "encoder": "default"}}
resp = session.post(url, json=data)
print(resp.json(strict=False))
Config of the indexes
index contentidfirst {
charset_table = non_cjk
type = rt
path = D:\Code\manticore\indexdir\contentidfirst\contentidfirst
rt_field = contentid
rt_field = content
rt_field = title
rt_field = collection
rt_field = sgcontent
rt_field = access
rt_field = jira
rt_field = type
rt_field = info
rt_field = transcript
rt_field = eventtype
rt_field = published
rt_field = spublished
rt_field = language
rt_field = event
rt_field = status
rt_field = child
rt_field = parent
rt_field = ppublished
rt_field = cpublished
rt_field = tag
rt_field = summary
rt_field = speaker
rt_field = evententities
rt_field = aboutentities
rt_field = markup
rt_field = attachment
rt_field = tkeyword
rt_field = mkeyword
rt_field = venue
rt_field = location
rt_field = city
rt_field = state
rt_field = country
rt_field = area
rt_field = mstatus
rt_field = markuper
rt_field = userperm
rt_field = titleid
rt_mem_limit = 128M
preopen = 1
min_infix_len = 2
html_strip = 1
index_sp = 1
bigram_index = both_freq
bigram_freq_words = a, am, an, and, are, as, at, be, but, by, can, did, do, for, i, if, in, is, it, its, no, not, of, on, or, so, to, was
}
index titleidfirst {
charset_table = non_cjk
type = rt
path = D:\Code\manticore\indexdir\titleidfirst\titleidfirst
rt_field = titleid
rt_field = content
rt_field = title
rt_field = collection
rt_field = sgcontent
rt_field = access
rt_field = jira
rt_field = type
rt_field = info
rt_field = transcript
rt_field = eventtype
rt_field = published
rt_field = spublished
rt_field = language
rt_field = event
rt_field = status
rt_field = child
rt_field = parent
rt_field = ppublished
rt_field = cpublished
rt_field = tag
rt_field = summary
rt_field = speaker
rt_field = evententities
rt_field = aboutentities
rt_field = markup
rt_field = attachment
rt_field = tkeyword
rt_field = mkeyword
rt_field = venue
rt_field = location
rt_field = city
rt_field = state
rt_field = country
rt_field = area
rt_field = mstatus
rt_field = markuper
rt_field = userperm
rt_field = contentid
rt_mem_limit = 128M
preopen = 1
min_infix_len = 2
html_strip = 1
index_sp = 1
bigram_index = both_freq
bigram_freq_words = a, am, an, and, are, as, at, be, but, by, can, did, do, for, i, if, in, is, it, its, no, not, of, on, or, so, to, was
}