tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

How do I handle documents with different types of content?

Open mainrs opened this issue 2 years ago • 6 comments

I am using tantivy to build a local indexer for files on my system. Different file formats have different properties.

Audio files contain metadata of the composer, the album, the year, the title. And pictures might contain metadata like camera, location and image height and width.

tantivy uses a immutable schema. What approaches can I take to build my schema? Is the way to go to build a struct that holds every possible value? Or would one build different schemas for different file types?

(I think the second option requires me to handle the querying of different indexes simultaneously, right?)

mainrs avatar Oct 18 '23 09:10 mainrs

Currently you can either:

  1. Create multiple indices
  2. Create a schema that contains all the fields (not great if there is a type conflict)
  3. Move your data into a JSON field (only one config for all nested fields, supports mixed types)

With the next release we'll maybe have dynamic schema support (basically JSON field type on the root level)

(I think the second option requires me to handle the querying of different indexes simultaneously, right?)

Yes

PSeitz avatar Oct 18 '23 09:10 PSeitz

With the next release we'll maybe have dynamic schema support (basically JSON field type on the root level)

Could you elaborate a little bit more on how this would work then?

mainrs avatar Oct 18 '23 09:10 mainrs

Could you elaborate a little bit more on how this would work then?

The current design proposal is here: https://github.com/quickwit-oss/tantivy/issues/2215

PSeitz avatar Oct 18 '23 10:10 PSeitz

I am unsure whether dynamic schema support does really change the equation here: It would still be a single schema that contains all (optional) fields, just that the fields are not enumerated up front by detected during runtime.

And from purely a search perspective, I don't think having a schema that contains all fields of all file types really is a problem. It is only when you want to recreate e.g. an enumerated type from what is in the index that you need to consider what original file type created the found entry.

One approach I personally like is to use the index fields really only for searching and put a "lossless" representation of the Rust data type into the document store using a simple serialization format like bincode. This way, one can use the overlapping optional fields to find documents and deserialize exactly what was put into the index no matter how complicated that Rust type was (as long as it implements Serializable).

So in summary, I would recommend to go with option 2 if your fields are known upfront (even if different from item to item) and do not change dynamically during runtime.

adamreichold avatar Oct 23 '23 06:10 adamreichold

So in summary, I would recommend to go with option 2 if your fields are known upfront (even if different from item to item) and do not change dynamically during runtime.

So if my use-case might require dynamic fields, it would be better to the JSON field? Is it possible to query that JSON field in a general way without specifying fields?

{
    "artist": "mainrs",
    "year": 2023
}

Could I simply query "mainrs" and get hits for the JSON above, or would I have to specifically query with artist:mainrs?

mainrs avatar Dec 05 '23 18:12 mainrs

Yes, dynamic fields are supported with JSON field.

You can query via mainrs or yourjsonfield.artist:mainrs

PSeitz avatar Dec 18 '23 03:12 PSeitz