tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Empty results

Open changhiskhan opened this issue 2 years ago • 6 comments

Is it expected that tantivy sometimes return empty results (no hits) ?

The data is private, but the code roughly looks like the following:

  1. Create index:
schema_builder = tantivy.SchemaBuilder()
for name in text_fields:
    schema_builder.add_text_field(name, stored=True)
schema = schema_builder.build()

index = tantivy.Index(schema, path=index_path)
  1. Populate index
writer = index.writer()
for b in rows:
      for i in range(b.num_rows):
          doc = tantivy.Document()
          for name in fields:
              doc.add_text(name, b[name][i].as_py())
          writer.add_document(doc)
writer.commit()
  1. Search
searcher = index.searcher()
query = index.parse_query(query)
results = searcher.search(query, limit)

I'm passing in a naturally language question. Sometimes if i pass in a whole question it returns something but if i only pass in a few words, it could return nothing. i would have thought it always returns something even if the scores are bad?

changhiskhan avatar Jun 06 '23 00:06 changhiskhan

this is a tantivy-py question https://github.com/quickwit-oss/tantivy-py.

The tantivy query parser makes it possible to define whether you want to treat queries as a disjunction of terms (OR) or conjunction. By default it is supposed to be a disjunction (which is what your want) I don't know what the problem you are experiencing is.

Could you share the few keywords and the faulty query maybe?

Also, is the query in Chinese or any other CJK language?

fulmicoton avatar Jun 06 '23 01:06 fulmicoton

It's in English. The query is like a natural language question similar to something like "what is the right product for users who wants a fashionable blue shirt?"

The observed behavior is that most of these questions return answers but some of them don't, or the full question returns answers but just the first 4 words won't.

I would have expected that it would return non-relevant answers rather than no answers - is this a knob I set wrong or?

changhiskhan avatar Jun 06 '23 19:06 changhiskhan

Thanks for answering btw - I'm continuing here but let me know if I should create a new issue in Tantivy-py instead to follow up

changhiskhan avatar Jun 06 '23 19:06 changhiskhan

The bullet-point in the feature list Natural query language is probably misleading. You can't phrase questions using natural language like that in tantivy (at least not without customizing or additional tools). What the query can do is providing easy access to search on structured data, e.g. color:blue AND style:fashionable AND category:shirt

PSeitz avatar Jun 07 '23 01:06 PSeitz

@PSeitz But by default it should be parsed as a disjunction... You should get some results?

@changhiskhan > Is this something where you used a quotation mark maybe?

"what is the right product for users who wants a fashionable blue shirt?"

is a phrase query

what is the right product for users who wants a fashionable blue shirt?

should be a disjunction.

Maybe the query parser does not like the question mark at the end? Do you still have 0 results without the question mark? If so tantivy-py should have raise an Exception, and this would be the bug.

Can you tell us what it looks like when you print(query)?

fulmicoton avatar Jun 07 '23 02:06 fulmicoton

Sorry the actual query is private but thanks for the pointers on quoting. I already tried taking out punctuations but that didn't seem to affect it.

Will poke and report back. Thanks!

changhiskhan avatar Jun 07 '23 20:06 changhiskhan