tantivy Questions about how to implement exact text match search.

I find tantivy very useful for fuzzy text queries, but now I have a scenario where I need to implement an exact text match. Here's a specific example: I have a schema with two fields: row_id and text. The first 5 rows of data are as follows:

0 "The cat sleeps as the sun sets."
1 "Leaves rustle while the cat sleeps."
2 "The gentle breeze moves the leaves."
3 "Morning light shines as the breeze blows."
4 "In the morning light, stars twinkle and fade."

Now, I want to search for "Morning light shines as the breeze blows." so that tantivy only returns the fourth row result. Is there a way I can achieve this scenario during the search?

Nov 28 '23 02:11 MochiXu

You can use phrase search for this, it's not exact match though. You could use the raw tokenizer for exact match.

Nov 28 '23 02:11 PSeitz

@PSeitz Thank you for your response. I tried using the raw tokenizer locally, but I encountered a problem. The raw tokenizer can only be applied to text columns without spaces. If the text column indexed by raw contains spaces, then it cannot be searched.

Here is an example:

Test 1.

schema: row_id: u64, raw_text: String
0 "The cat sleeps as the sun sets."
1 "Leaves rustle while the cat sleeps."
2 "The gentle breeze moves the leaves."
3 "Morning light shines as the breeze blows."
4 "In the morning light, stars twinkle and fade."

If I try to search for "Morning light shines as the breeze blows.", then no results can be found.

Test 2.

schema: row_id: u64, raw_text: String
0 "The_cat_sleeps_as_the_sun_sets."
1 "Leaves_rustle_while_the_cat_sleeps."
2 "The_gentle_breeze_moves_the_leaves."
3 "Morning_light_shines_as_the_breeze_blows."
4 "In_the_morning_light,_stars_twinkle_and_fade."

If I try to search for "Morning_light_shines_as_the_breeze_blows.", I am able to successfully find the fourth result.

Nov 28 '23 07:11 MochiXu

I added some println statements in the tokenizer of Tantivy, and I noticed that the index_writer indeed does not tokenize the string during the writing process.

However, when executing a query, the query string is tokenized. I suspect that this is the reason why strings containing spaces cannot be searched.

I would like to know how to resolve this issue.

index_writer.add_document(doc! { 
    row_id => 0 as u64,
    raw_text => "Alick a01",
    text => "Alick a01" 
}).unwrap();

==========================

new raw tokenizer: "Alick a01"
RawTokenStream token: "Alick a01"

let raw_query_parser = QueryParser::for_index(&index, vec![schema.get_field("raw_text").unwrap()]);
let raw_text_query = raw_query_parser.parse_query("Alick a01").unwrap();
let raw_query_docs = searcher.search(&raw_text_query, &TopDocs::with_limit(10000)).expect("failed to search");

==========================

new raw tokenizer: "Alick"
RawTokenStream token: "Alick"
new raw tokenizer: "a01"
RawTokenStream token: "a01"

Nov 28 '23 08:11 MochiXu

I think you need to phrase the query

let raw_text_query = raw_query_parser.parse_query("\"Alick a01\"").unwrap();

Ideally the query parser would handle this use case.

Nov 28 '23 08:11 PSeitz

Thank you very much for your answer, my problem has been solved.💗❤️

Nov 28 '23 08:11 MochiXu

Translation: I want to know the difference between these two pieces of code: parse_query("\"Alick a01\"") and parse_query("Alick a01"). Can it be understood that, under any tokenizer condition, parse_query("\"Alick a01\"") will always attempt an exact match?

Dec 06 '23 07:12 MochiXu

@PSeitz Additionally, I've noticed that when indexing the following two phrases, searching for one of them yields related results for both sentence 1 and sentence 2. However, if the word "a" is removed from these two phrases, searching for one of them will only yield one sentence.

sentence 1: Distant thunder rolls over a prairie.
sentence 2: Hummingbirds flutter in a flower garden.

I believe that although the word "a" appears in both sentences, there is actually no relevance between the two phrases. Therefore, searching for one of them should not return both sentences.

Dec 06 '23 07:12 MochiXu

Are you referring to the MoreLikeThisQuery?

Dec 06 '23 08:12 PSeitz

Here, I have provided a reproducible code snippet, where str_vec stores two sentences, both containing the word 'a'. During the search process, when searching for the first sentence, the query count output is 2. If I remove the word 'a' from both sentences in str_vec, and also remove the word 'a' from the searched sentence, the query count output changes to 1.

I believe this behavior is unreasonable, as the word 'a' should not indicate any association between these two sentences.

#[test]
fn test_single_query() {
    let mut schema_builder = Schema::builder();
    let row_id = schema_builder.add_u64_field("row_id", FAST|INDEXED);
    let text = schema_builder.add_text_field("text", TEXT);
    let schema = schema_builder.build();
    let index = Index::create_in_ram(schema);

    let mut index_writer = index.writer_with_num_threads(8, 1024 * 1024 * 1024).unwrap();

    // Here are two English sentences provided, each containing the word 'a'.
    let str_vec: Vec<String> = vec![
        "Distant thunder rolls over a prairie.".to_string(),
        "Hummingbirds flutter in a flower garden.".to_string(),
        ];
    
    for i in 0..str_vec.len() {
        let mut temp = Document::default();
        temp.add_u64(row_id, i as u64);
        temp.add_text(text, &str_vec[i]);
        let _ = index_writer.add_document(temp);
    }
    index_writer.commit().unwrap();

    let reader = index.reader().unwrap();
    let searcher = reader.searcher();
    let schema = index.schema();
    let query_parser = QueryParser::for_index(&index, vec![schema.get_field("text").unwrap()]);

    // Search for the first sentence provided by the above code.
    let normal_text_query = query_parser.parse_query("Distant thunder rolls over a prairie.").unwrap();
    let normal_query_count = searcher.search(&normal_text_query, &Count).expect("failed to search");
    println!("normal query count:{:?}", normal_query_count);

    assert_eq!(normal_query_count, 1);
}

Dec 06 '23 08:12 MochiXu

a is a term that gets indexed. The default behavior of the query is to OR connect all terms, which also contains a. Ignoring words like a is done via stop words.

Dec 06 '23 10:12 PSeitz

So how should I apply stop words during the text indexing stage and searching stage? I have tried using the en_stem tokenizer for indexing, but the results are still consistent with what was described above.

Dec 06 '23 10:12 MochiXu

You need to add the StopWordFilter to your tokenizer

Dec 06 '23 10:12 PSeitz

tantivy tantivy copied to clipboard

Questions about how to implement exact text match search.

tantivy
tantivy copied to clipboard