tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Storing arbitrary attributes for segments

Open ppodolsky opened this issue 4 years ago • 4 comments

In my project search index files are deployed through something like torrent technology. New replicas are asking by broadcast who can seed segments identified by X, Y, Z etc and then leeches all these segments.

Merging removes old segments so I have to choose between enabled merging and possibility to seed old segments.

My proposal is to give a way to store some extra key-value data for segments. It would allow to mark some segments as i.e. "frozen" = true and to omit these segments while calcualting merge policies.

Probably there is a better way. What is you thought about it? I could try to implement by myself so feel free to choose from options that are really suited better for further development.

ppodolsky avatar Mar 28 '21 08:03 ppodolsky

Attaching meta data to segments and exposing them to the merge policy seems like a great idea.

Can you work on an actual spec?

fulmicoton avatar Mar 29 '21 01:03 fulmicoton

Sure, but it seems we need to work out a solution in #971 firstly

ppodolsky avatar Mar 29 '21 09:03 ppodolsky

@fulmicoton I've come back here to the issue as I'm in need of this now:) I've taken a look over various ways and here are some initial thoughts:

  • meta.json already stores some attributes for each segment
  • Another way is to store attributes in separate file within segment

I'm slighly shifted toward having a separate file (something like .props). But it could be confusing taking into account that some attributes are already stored in meta.json

The main advantage of moving all attributes (max_doc, deletes and others which we would like to have in the future) to a separate file is that segments could be distributed between servers in easier way. Having all segment metadata inside the .props file means that you could attach new segment to the index by merely putting files into index directory and adding segment_id to the list of actual index segments in meta.json

ppodolsky avatar Aug 07 '21 16:08 ppodolsky

One thing that would be interesting for the quickwit use case is to store values in fields per segment if you have a limited number of values.

This information can be used for pruning or in merge options that prefer to merge same values together. This wouldn't be arbitrary information, but should probably stored in the same file.

PSeitz avatar Aug 08 '21 08:08 PSeitz