tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Best approach to handling searches across multiple object schemas

Open Advait-M opened this issue 6 months ago • 3 comments

Hey! We have several different object types and a unified search across them - namely we have objects like:

  • Workflows - with names, descriptions, folders
  • Notebooks - with titles and content
  • Actions - with just the action name
  • Launch configs - with just the launch config name
  • etc

Each of these has a different set of weightings e.g. you can imagine the ordering of most important fields for workflows is name, description and then the enclosing folder. We implement logic for these weightings at query-time with BoostQuerys (snippet below).

        // Add term queries for all words except the last one.
        if words.len() > 1 {
            for word in &words[0..words.len() - 1] {
                for (field, weight) in self.weighted_search_fields.values() {
                    let term = Term::from_field_text(*field, word);
                    let term_query = build_term_query(term);
                    let weighted_query = Box::new(BoostQuery::new(
                        term_query,
                        // Boost the term query by the field weight, normalized by the total weight so the final
                        // score is in the range of roughly 0-5. Complex queries might have a score exceeding 5.
                        *weight * SCORE_BOOST_FACTOR / self.normalizing_factor,
                    ));
                    subqueries.push((Occur::Should, weighted_query));
                }
            }
        }

Currently, we've structured this as multiple Tantivy full-text searchers - one for each data source, where we define a schema for each object type. Then, when we have a search (the user enters a search term on the command palette), we run the search across these different searchers asynchronously, and return an aggregated ranked set of results.

However, we've seen this scales the number of threads we're spinning up proportionally to the number of data sources, which isn't great (related to https://github.com/quickwit-oss/tantivy/issues/702).

An approach we're considering is the following:

  • Define a unified schema with all possible fields from every object type, with no inherent weightings/boosts
  • Objects like Actions would just have empty fields for any that aren't relevant for that object type
  • Extend the query-time piece to filter by type of object first, and then use type-conditional BoostQuerys to account for the weights

This would result in a single searcher running async.

Wanted to check if this is the recommended approach for this sort of search across different object types w/ different schemas? Thanks!

Advait-M avatar Jul 01 '25 19:07 Advait-M

I think the alternative approach you are describing takes the right direction.

Objects like Actions would just have empty fields for any that aren't relevant for that object type

That is ok. Empty fields just don't appear in an inverted index (except by increasing deltas in posting lists).

If you plan to always apply different scoring logics to different object types, instead of sharing fields between object types (e.g the field name appears in both workflows and launch configs), you could also separate them entirely (e.g have workflow.name and launch_config.name). This way you wouldn't need to filter by object type first. I'm not sure which one would be more efficient.

rdettai avatar Jul 02 '25 08:07 rdettai

I think the alternative approach you are describing takes the right direction.

Objects like Actions would just have empty fields for any that aren't relevant for that object type

That is ok. Empty fields just don't appear in an inverted index (except by increasing deltas in posting lists).

If you plan to always apply different scoring logics to different object types, instead of sharing fields between object types (e.g the field name appears in both workflows and launch configs), you could also separate them entirely (e.g have workflow.name and launch_config.name). This way you wouldn't need to filter by object type first. I'm not sure which one would be more efficient.

Got it - thanks for weighing in.

Makes sense re unique fields entirely too - will explore!

Advait-M avatar Jul 11 '25 18:07 Advait-M

Searcher's have a search_with_executor method by the way

fulmicoton-dd avatar Aug 01 '25 05:08 fulmicoton-dd