tantivy
tantivy copied to clipboard
Questions about how to implement exact text match search.
I find tantivy
very useful for fuzzy text queries, but now I have a scenario where I need to implement an exact text match. Here's a specific example:
I have a schema with two fields: row_id
and text
. The first 5 rows of data are as follows:
0 "The cat sleeps as the sun sets."
1 "Leaves rustle while the cat sleeps."
2 "The gentle breeze moves the leaves."
3 "Morning light shines as the breeze blows."
4 "In the morning light, stars twinkle and fade."
Now, I want to search for "Morning light shines as the breeze blows." so that tantivy
only returns the fourth row result. Is there a way I can achieve this scenario during the search?
You can use phrase search for this, it's not exact match though. You could use the raw tokenizer for exact match.
@PSeitz Thank you for your response. I tried using the raw tokenizer
locally, but I encountered a problem. The raw tokenizer
can only be applied to text columns without spaces. If the text column indexed by raw
contains spaces, then it cannot be searched.
Here is an example:
Test 1.
schema: row_id: u64, raw_text: String
0 "The cat sleeps as the sun sets."
1 "Leaves rustle while the cat sleeps."
2 "The gentle breeze moves the leaves."
3 "Morning light shines as the breeze blows."
4 "In the morning light, stars twinkle and fade."
If I try to search for "Morning light shines as the breeze blows."
, then no results can be found.
Test 2.
schema: row_id: u64, raw_text: String
0 "The_cat_sleeps_as_the_sun_sets."
1 "Leaves_rustle_while_the_cat_sleeps."
2 "The_gentle_breeze_moves_the_leaves."
3 "Morning_light_shines_as_the_breeze_blows."
4 "In_the_morning_light,_stars_twinkle_and_fade."
If I try to search for "Morning_light_shines_as_the_breeze_blows."
, I am able to successfully find the fourth result.
I added some println
statements in the tokenizer
of Tantivy, and I noticed that the index_writer
indeed does not tokenize the string during the writing process.
However, when executing a query, the query string is tokenized
. I suspect that this is the reason why strings containing spaces cannot be searched.
I would like to know how to resolve this issue.
index_writer.add_document(doc! {
row_id => 0 as u64,
raw_text => "Alick a01",
text => "Alick a01"
}).unwrap();
==========================
new raw tokenizer: "Alick a01"
RawTokenStream token: "Alick a01"
let raw_query_parser = QueryParser::for_index(&index, vec![schema.get_field("raw_text").unwrap()]);
let raw_text_query = raw_query_parser.parse_query("Alick a01").unwrap();
let raw_query_docs = searcher.search(&raw_text_query, &TopDocs::with_limit(10000)).expect("failed to search");
==========================
new raw tokenizer: "Alick"
RawTokenStream token: "Alick"
new raw tokenizer: "a01"
RawTokenStream token: "a01"
I think you need to phrase the query
let raw_text_query = raw_query_parser.parse_query("\"Alick a01\"").unwrap();
Ideally the query parser would handle this use case.
Thank you very much for your answer, my problem has been solved.💗❤️
Translation: I want to know the difference between these two pieces of code: parse_query("\"Alick a01\"")
and parse_query("Alick a01")
. Can it be understood that, under any tokenizer condition, parse_query("\"Alick a01\"")
will always attempt an exact match?
@PSeitz Additionally, I've noticed that when indexing the following two phrases, searching for one of them yields related results for both sentence 1 and sentence 2. However, if the word "a" is removed from these two phrases, searching for one of them will only yield one sentence.
- sentence 1:
Distant thunder rolls over a prairie.
- sentence 2:
Hummingbirds flutter in a flower garden.
I believe that although the word "a" appears in both sentences, there is actually no relevance between the two phrases. Therefore, searching for one of them should not return both sentences.
Are you referring to the MoreLikeThisQuery
?
Here, I have provided a reproducible code snippet, where str_vec
stores two sentences, both containing the word 'a'
. During the search process, when searching for the first sentence, the query count
output is 2
. If I remove the word 'a'
from both sentences in str_vec
, and also remove the word 'a'
from the searched sentence, the query count
output changes to 1
.
I believe this behavior is unreasonable, as the word 'a'
should not indicate any association between these two sentences.
#[test]
fn test_single_query() {
let mut schema_builder = Schema::builder();
let row_id = schema_builder.add_u64_field("row_id", FAST|INDEXED);
let text = schema_builder.add_text_field("text", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(8, 1024 * 1024 * 1024).unwrap();
// Here are two English sentences provided, each containing the word 'a'.
let str_vec: Vec<String> = vec![
"Distant thunder rolls over a prairie.".to_string(),
"Hummingbirds flutter in a flower garden.".to_string(),
];
for i in 0..str_vec.len() {
let mut temp = Document::default();
temp.add_u64(row_id, i as u64);
temp.add_text(text, &str_vec[i]);
let _ = index_writer.add_document(temp);
}
index_writer.commit().unwrap();
let reader = index.reader().unwrap();
let searcher = reader.searcher();
let schema = index.schema();
let query_parser = QueryParser::for_index(&index, vec![schema.get_field("text").unwrap()]);
// Search for the first sentence provided by the above code.
let normal_text_query = query_parser.parse_query("Distant thunder rolls over a prairie.").unwrap();
let normal_query_count = searcher.search(&normal_text_query, &Count).expect("failed to search");
println!("normal query count:{:?}", normal_query_count);
assert_eq!(normal_query_count, 1);
}
a
is a term that gets indexed. The default behavior of the query is to OR connect all terms, which also contains a
. Ignoring words like a
is done via stop words.
So how should I apply stop words during the text indexing stage and searching stage? I have tried using the en_stem
tokenizer for indexing, but the results are still consistent with what was described above.
You need to add the StopWordFilter to your tokenizer