tantivy About different indices and schemas

About different indices and schemas

Open mainrs opened this issue 2 years ago • 2 comments

My use-case is an indexing agent that indexes certain websites that a user specifies. The websites are grouped into similar technology stacks.

For example, all ***.stackoverflow.com websites are the same, all websites using Wikimedia behave the same etc.. And each of these families might have different features and properties I would like to index. Depending on the family it might be possible to extract more knowledge or more structured knowledge than from a simple website.

Other families could be audio files containing a lot of metadata I'd like to be able to query for: lyrics, year, artist. Maybe images too with their: size, timestamps, primary colors, aspect ratio etc..

This question is a follow-up of #2221. Performance wise, is it better to have one single, big index with sparse entries. Or would it be better to have a single index for each family mentioned above. And have multiple readers accessing the index files simultaneously?

It is hard to test this without building the index beforehand. But it takes a lot of time to prototype it. So I was hoping for people with more insights to give me a little bit help and advice.

I personally feel like the second approach might explode quickly if one has too many families.

Dec 05 '23 18:12 mainrs

Indexing or search performance? What type of query?

Dec 18 '23 04:12 PSeitz

@mainrs did you figure out an optimal solution for your problem? I am dealing with something quite similar and would appreciate any insight you may have to offer!

Jul 14 '24 09:07 gsidhu

tantivy tantivy copied to clipboard

About different indices and schemas

tantivy
tantivy copied to clipboard