tantivy
tantivy copied to clipboard
Generate meaningful SegmentIDs instead of pure random
Is your feature request related to a problem? Please describe. Related to #969 I would like to suggest a cheap feature which will help debugging in the future. Now SegmentIDs are generated randomly but it is a waste of 16 bytes which could be used to embed debugging info otherwise.
Describe the solution you'd like Generate SegmentID containing the following info:
- timestamp of segment creation
- segment origin (merging or writing new data)
- hash of hostname, probably useful for those who will implement sharding/replication paired with Tantivy.
Additionally here we should ensure that there is left enough randomness to avoid any possibility of collisions on the one hand, and that names are not too long to avoid metadata bloating (? not sure if it is actual, the number of segments is supposed to be relatively low by design afaik) on the other hand.
I would be interested working on this feature, but I am uncertain how to move forward.
hash of hostname, probably useful for those who will implement sharding/replication paired with Tantivy.
This sounds to me like a feature outside of the scope of Tantivy and better suited to projects that implement a distributed search engine over it.
segment origin (merging or writing new data)
From what I understand of segment merging, I would have the following layout most of the time:
segment-1-merge
segment-2-new
segment-3-new
...
segment-20-new
It feels like this information could be deduced from the created timestamp of a file: the oldest file is probably one that got merged, which could be retrieved by executing ls -l. However, with this alone I can't be sure if the oldest file results from a merge.
I had a look at Lucene for some inspiration: from what I can tell, a segment name is created by IndexWriter#newSegmentName which increases a segment counter.
It seems that a counter would resolve the previous shortcoming: if the counter is 1, then for sure the oldest file is a newly created segment.
What about a segment ID that is just a generation counter ? The nice aspect however of using uuid at the moment is that a segment can be created by any thread, no sync needed.
This sounds to me like a feature outside of the scope of Tantivy and better suited to projects that implement a distributed search engine over it.
Well we could consider passing an optional string to the IndexWriter, but I agree #999 would be a much better fit. Let's try and properly fleshout an exhaustive list of stuff we could want to see in the segment name and in segment metas, and then arbitrage this.
@ppodolsky @scampi @guilload @fmassot
- meaningful lexicographical sort (sort by commit timestamp)
- a label given to the indexwriter (hostname)... But then would happens when merging two segments with different labels? Do we really want to enforce siloing these labels?
- segment generation
- previous generation ids
- commit id (due to multithreading... a commit gives birth to more than one segment)
- obviously we want it to be universally unique
- a not-so unique human friendly label to help human memorize and communicate labels when investigating a bug
What else ?
Is it worth to include Tantivy version? Not footer version but Tantivy one's.
But then would happens when merging two segments with different labels? Do we really want to enforce siloing these labels?
What do you mean by siloing? Keeping isolated sets of labels of all ancestors of the segment?
Options:
- Erase labels
- Overwrite labels by what is set for the index_writer doing current merging
- Allow users to define a policy of label merging
This sounds to me like a feature outside of the scope of Tantivy and better suited to projects that implement a distributed search engine over it.
A naming policy of segments is fully managed by Tantivy and there is no way to highjack for any other project into it. It should be considered from the point of making debugging easier, details could be seen in #969.
So I'd like to suggest to make a decision.
@fulmicoton told that segment attributes are better approach so what should be left for embedding into name of segments? Probably it is worth to shorten the list down to the most important ones:
- Unixtime of generation
- Natural or merging origin
This sounds to me like a feature outside of the scope of Tantivy and better suited to projects that implement a distributed search engine over it.
A naming policy of segments is fully managed by Tantivy and there is no way to highjack for any other project into it. It should be considered from the point of making debugging easier, details could be seen in #969.
@ppodolsky I agree, I wrote this comment regarding the hashing of the hostname, not about the naming of segments.