Hits returned is Nonetype object even for searches that should return multiple results

Open regstuff opened this issue 1 year ago • 4 comments

Describe the bug I'm using the Python api client. For some text searches, I get resp.hits as a Nonetype object as below.

{'aggregations': None,
 'hits': None,
 'profile': None,
 'timed_out': None,
 'took': None,
 'warning': None}

If there were zero results, I'd expect something like
{'aggregations': None,
 'hits': 'hits': [],

But that's not the only problem. I'm getting the NoneType error for searches that I know should return multiple results. For eg. My rt index has three fields, all of which are text: title, content and filename The words: human quickly and choice, are present in the content, but searching for this like so : @* human quickly choice returns zero results. However, the strange thing is that if I search for @* human quickly choice @filename engc09098 I actually get the expected result This behaviour is NOT present in all searches. Many text searches are working fine. But it seems to be that documents that contain a certain subset of words from the content field have this issue. I have 41659 documents in the index according to index status command, which is the expected number. I'd say may be 200 or so documents seem to have this issue. In fact, I took the above engc09098 document and reindexed it separately and added a random word 'xibalba' into the content field. The 'human quickly choice' again gave NoneType, but 'human quickly choice xibalba' shows one result as expected. Have even tried just indexing a single sentence: human quickly choice, and again I get Nonetype for the above search.

I'm using Manticore 5.0.2 348514c86@220530 dev on Windows 10 10.0.18362.30 with Python 3.9.0 I'm running it in console mode, which is where I get the above messages when I print the response for the search. Manticore config file is as below, but I've tried changing the config, running in rt mode rather than plain mode etc. All situations are the same.

common {
    #plugin_dir = /usr/local/lib/manticore
}

searchd {
    listen = 127.0.0.1:9312
    listen = 127.0.0.1:9306:mysql
    listen = 127.0.0.1:9308:http
    listen = 127.0.0.1:9318:_vip
    log = logs/searchd.log
    query_log = logs/query.log
    pid_file = run/searchd.pid
    #data_dir = indexdir/products/
    query_log_format = sphinxql
    query_log_min_msec  = 1
    binlog_max_log_size = 16M
    not_terms_only_allowed = 1
    max_packet_size = 128M
}

index products {
  charset_table = 0..9, english, _
  type = rt
  path = D:\Code\manticore\indexdir\products\products
  rt_field = content
  rt_field = title
  rt_field = contentid
  rt_mem_limit = 128M
  preopen = 1
  min_infix_len = 2
  html_strip = 1
  index_sp = 1
  index_zones = h*, title
  bigram_index = both_freq
  bigram_freq_words = a, am, an, and, are, as, at, be, but, by, can, did, do, for, i, if, in, is, it, its, no, not, isn, t, of, on, or, so, to, was
}

Index status command output is also shown below

[{'columns': [{'Variable_name': {'type': 'string'}}, {'Value': {'type': 'string'}}], 'data': [{'Variable_name': 'index_type', 'Value': 'rt'}, {'Variable_name': 'indexed_documents', 'Value': '41659'}, {'Variable_name': 'indexed_bytes', 'Value': '174271306'}, {'Variable_name': 'ram_bytes', 'Value': '6832'}, {'Variable_name': 'disk_bytes', 'Value': '195644760'}, {'Variable_name': 'disk_mapped', 'Value': '4468111'}, {'Variable_name': 'disk_mapped_cached', 'Value': '0'}, {'Variable_name': 'disk_mapped_doclists', 'Value': '0'}, {'Variable_name': 'disk_mapped_cached_doclists', 'Value': '0'}, 
{'Variable_name': 'disk_mapped_hitlists', 'Value': '0'}, {'Variable_name': 'disk_mapped_cached_hitlists', 'Value': '0'}, {'Variable_name': 'killed_documents', 'Value': '0'}, {'Variable_name': 'killed_rate', 'Value': '0.00%'}, {'Variable_name': 'ram_chunk', 'Value': '0'}, {'Variable_name': 'ram_chunk_segments_count', 'Value': '0'}, {'Variable_name': 'disk_chunks', 'Value': '1'}, {'Variable_name': 'mem_limit', 'Value': '134217728'}, {'Variable_name': 'mem_limit_rate', 'Value': '95.00%'}, {'Variable_name': 'ram_bytes_retired', 'Value': '0'}, {'Variable_name': 'tid', 'Value': '0'}, {'Variable_name': 'tid_saved', 'Value': '0'}, {'Variable_name': 'query_time_1min', 'Value': '{"queries":3, "avg_sec":0.019, "min_sec":0.011, "max_sec":0.024, "pct95_sec":0.024, "pct99_sec":0.024}'}, {'Variable_name': 'query_time_5min', 'Value': '{"queries":5, "avg_sec":0.021, "min_sec":0.011, "max_sec":0.025, "pct95_sec":0.025, "pct99_sec":0.025}'}, {'Variable_name': 'query_time_15min', 'Value': '{"queries":5, "avg_sec":0.021, "min_sec":0.011, "max_sec":0.025, "pct95_sec":0.025, "pct99_sec":0.025}'}, {'Variable_name': 'query_time_total', 'Value': '{"queries":5, "avg_sec":0.021, "min_sec":0.011, "max_sec":0.025, "pct95_sec":0.025, "pct99_sec":0.025}'}, {'Variable_name': 'found_rows_1min', 'Value': '{"queries":3, "avg":63, "min":36, "max":80, "pct95":80, "pct99":80}'}, {'Variable_name': 'found_rows_5min', 'Value': '{"queries":5, "avg":40, "min":6, "max":80, "pct95":80, "pct99":80}'}, 
{'Variable_name': 'found_rows_15min', 'Value': '{"queries":5, "avg":40, "min":6, "max":80, "pct95":80, "pct99":80}'}, {'Variable_name': 'found_rows_total', 'Value': '{"queries":5, "avg":40, "min":6, "max":80, "pct95":80, "pct99":80}'}], 'total': 29, 'error': '', 'warning': ''}]

Aug 21 '22 05:08 regstuff

Some further notes on the possible issue: Seems like searches that return documents which have \x0b – the vertical tab - in them are giving this error. There may be other characters causing issues also, but this is one I have identified for now. Using curl on the same search works even with this character. Using python requests library also works, but the returned response fails to parse into json due to the above character. Setting strict=False while loading the json response however allows us to parse the result. Perhaps there is a similar issue in the API client

Aug 21 '22 09:08 regstuff

could you provide source data with document that reproduce this issue locally along with query and reply for that query

As it could be better to test complete reproducible example than try to figure out what could be wrong from text description of bad result set

Aug 21 '22 10:08 tomatolog

test.csv Here's one line from the doc that is causing the issue and has the vertical tab. Below code will read the csv file, write it to a rt index in plain mode, and do a search with the python API, which will give NoneType, and then again do a search with python requests, and will read the response with json strict=False, which will work, and then strict=True, which will give the error. I have reproduced all these 3 from the console below the code for your reference.

from __future__ import print_function
import json, csv
import manticoresearch
import requests

config = manticoresearch.Configuration(
    host = "http://127.0.0.1:9308"
)

client =  manticoresearch.ApiClient(config)
indexApi = manticoresearch.IndexApi(client)
searchApi = manticoresearch.SearchApi(client)
utilsApi = manticoresearch.UtilsApi(client)

with open('test.csv', 'r', encoding='utf-8', errors='ignore') as f:
    reader = csv.DictReader(f)
    data = list(reader)

print(data)

for i, row in enumerate(data):
    resp = indexApi.insert({"index" : "products", "doc" : row})
print(resp)

queries = ['wherever whatever']

url = 'http://localhost:9308/search'
session = requests.session()
for squery in queries:
    print(squery)
    resp = searchApi.search({"index":"products", "query":{"query_string":f"{squery}"},"highlight":{"fields":["title", "content"]}})
    json = {"index":"products", "query":{"query_string":squery},"highlight":{"fields":["title", "content"]}}
    print(resp)
    resp = session.post(url, json=json)
    print(resp.json(strict=False))
    print(resp.json())

NoneType with API

{'aggregations': None,
 'hits': None,
 'profile': None,
 'timed_out': None,
 'took': None,
 'warning': None}

With json Strict=False

{'took': 1, 'timed_out': False, 'hits': {'total': 1, 'total_relation': 'eq', 'hits': [{'_id': '1677721600001', '_score': 1500, '_source': {'content': 'Wherever you are, whatever you may be in your life, you want to be something more.</p><p>Participants: Yes.</p><p>\x0bSadhguru: If that something more happens, what? ', 'title': 'Whether you will live joyfully or not should not be subject to anything because this is an inward thing', 'contentid': 'engm050323'}, 'highlight': {'title': ['Whether you will live joyfully or not should not be subject to anything because this is an inward thing'], 'content': ['<b>Wherever</b> you are, <b>whatever</b> you may be in your life, you want to be something more. Participants: Yes. Sadhguru: If that something more happens, what? ']}}]}}

With JSON strict = True

Traceback (most recent call last):
  File "D:\Code\manticore\manticore\lib\site-packages\requests\models.py", line 971, in json
    return complexjson.loads(self.text, **kwargs)
  File "c:\program files\python39\lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "c:\program files\python39\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "c:\program files\python39\lib\json\decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 250 (char 249)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Code\manticore\test.py", line 37, in <module>
    print(resp.json())
  File "D:\Code\manticore\manticore\lib\site-packages\requests\models.py", line 975, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Invalid control character at: line 1 column 250 (char 249)

Aug 21 '22 12:08 regstuff

MRE

Use vert_tab.sql.tgz

➜  ~ mysql -P9306 -h0 < vert_tab.sql
➜  ~ mysql -P9306 -h0 -e "select * from t"
+---------------------+------+
| id                  | f    |
+---------------------+------+
| 1514956542686789635 | a
                         b  |
+---------------------+------+
➜  ~ curl -sX POST http://localhost:9308/search  -d '{"index": "t"}'|jq .
parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 1, column 136
➜  ~ curl -sX POST http://localhost:9308/search  -d '{"index": "t"}'
{"took":0,"timed_out":false,"hits":{"total":1,"total_relation":"eq","hits":[{"_id":"1514956542686789635","_score":1,"_source":{"f":"a
                                                                                                                                     b"}}]}}%

I.e. the special character doesn't get escaped, which results in the failure of jq. The same issue with the python client.

Aug 22 '22 06:08 sanikolaev

Fixed in https://github.com/manticoresoftware/manticoresearch/commit/0c04edf625db673a93c211a3517e2a633d70497e

snikolaev@dev:~$ mysql -P9315 -h0 -e "status"|grep vers
Server version:		5.0.3 c53b5f1@220918 dev git branch master...origin/master
Protocol version:	10

snikolaev@dev:~$ mysql -P9315 -h0 < vert_tab.sql

snikolaev@dev:~$ mysql -P9315 -h0 -e "select * from t"
+---------------------+------+
| id                  | f    |
+---------------------+------+
| 2812039653278875649 | a
                         b  |
+---------------------+------+

snikolaev@dev:~$ curl -sX POST http://localhost:9316/search  -d '{"index": "t"}'|jq .
{
  "took": 0,
  "timed_out": false,
  "hits": {
    "total": 1,
    "total_relation": "eq",
    "hits": [
      {
        "_id": "2812039653278875649",
        "_score": 1,
        "_source": {
          "f": "a\u000bb"
        }
      }
    ]
  }
}

snikolaev@dev:~$ curl -sX POST http://localhost:9316/search  -d '{"index": "t"}'
{"took":0,"timed_out":false,"hits":{"total":1,"total_relation":"eq","hits":[{"_id":"2812039653278875649","_score":1,"_source":{"f":"a\u000bb"}}]}}

Sep 19 '22 06:09 sanikolaev

manticoresearch manticoresearch copied to clipboard

Hits returned is Nonetype object even for searches that should return multiple results

MRE

manticoresearch
manticoresearch copied to clipboard