tantivy
tantivy copied to clipboard
[feat] SegmentAttributes
Related to #999 This PR is a PoC. I've implemented following approach:
-
SegmentAttribute
enum, different options represent different types + different strategies of merging while merging segments - Every segment has
HashMap<String, SegmentAttribute>
akaSegmentAttributes
in its metadata -
IndexSettings
hasSegmentAttributesConfig
that is used to createSegmentAttributes
for inserting into every segment on segment creation. - While merging we additionally merge all
SegmentAttributes
and set them to a new segment.
@PSeitz @fulmicoton Could you review? I haven't implemented tests or helper functions because not sure if Tantivy still needs this feature at all.
Codecov Report
Merging #1512 (7c3df74) into main (96c93a6) will decrease coverage by
0.01%
. The diff coverage is85.41%
.
@@ Coverage Diff @@
## main #1512 +/- ##
==========================================
- Coverage 94.07% 94.05% -0.02%
==========================================
Files 259 259
Lines 49637 49716 +79
==========================================
+ Hits 46694 46760 +66
- Misses 2943 2956 +13
Impacted Files | Coverage Δ | |
---|---|---|
src/core/mod.rs | 100.00% <ø> (ø) |
|
src/core/index.rs | 90.80% <76.00%> (-0.69%) |
:arrow_down: |
src/core/index_meta.rs | 94.40% <76.92%> (-0.77%) |
:arrow_down: |
src/indexer/segment_updater.rs | 94.89% <87.17%> (-0.56%) |
:arrow_down: |
src/indexer/index_writer.rs | 96.79% <100.00%> (+0.01%) |
:arrow_up: |
src/indexer/log_merge_policy.rs | 98.12% <100.00%> (ø) |
|
src/indexer/merge_operation.rs | 100.00% <100.00%> (ø) |
|
src/indexer/segment_register.rs | 84.84% <100.00%> (ø) |
|
src/postings/stacker/expull.rs | 98.57% <0.00%> (-0.48%) |
:arrow_down: |
... and 3 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
Thank you @ppodolsky for this PR! We were a bit overwhelmed these days :) I don't have enough knowledge to say if such a feature can be added so I let @fulmicoton and @PSeitz answer on this.
But when looking a the code, I was not very fond of the enum, why not using a Trait for segment attributes?
Hi @fmassot ! Thank you for reply!
I can rewrite it to traits but I'd prefer to stabilize the approach verbally at first if the feature is still needed.
Further I'd like to confirm that putting segment attributes config in IndexMeta
and segment attributes in meta.json (not in separate file) is OK.
@fmassot
I have tried to implement it via traits and didn't like the way it looks. The main issue is that we need to store type. Enums are well suited for this purpose because they are serializable and naturally aligned with serialization format like
"<type_name>": <value>
.
Trait objects should be serialized in non-trivial way and we have to map and track trait and its serializable name for deserialization purposes
@ppodolsky I read again your PR and yes enum is fine per attribute... My thought about trait was on a higher level, I mean on the SegmentAttributes level. I don't have a genuine opinion on how to implement correctly this feature, and I did not want to add some noise here, sorry.
Concerning serialization/deserialization of traits, this can done with https://github.com/dtolnay/typetag, I believe it is just using an enum behind the scene.
Currently the behaviors are provided by tantivy. As an alternative, we could have a registry.
global_registry.register_attribute(name, &str, attribute: Attribute); // name has to be unique
trait Attribute{
fn merge(attrs: Vec<serde_json::Value>) -> serde_json::Value;
}
is_frozen attribute, ConjunctiveBool could be provided by tantivy.
{
"is_frozen": true
}
register_attribute("is_frozen", ConjunctiveBool)
Not sure if this too unwieldy.
@PSeitz Yes, I came to this design while trying to reimplement it with traits. I was slightly confused by using serde_json::Value as value for segment attribute, not sure if tying to specific serialization format is worth to do. And without serde_json::Value there will be a need to create a Value representation which would be enum again.
What do you think?
@PSeitz Yes, I came to this design while trying to reimplement it with traits. I was slightly confused by using serde_json::Value as value for segment attribute, not sure if tying to specific serialization format is worth to do. And without serde_json::Value there will be a need to create a Value representation which would be enum again.
What do you think?
The enum would look like serde_json::Value, so we can just use that. serde_json::Value is not tied to json de/serialization, it's implementing the De/Serialization traits.
Ok but still have questions. Is global registry ok? It will bind a set of available attributes to the compiled version of application. In the approach implemented here indices can manage a set of attributes by their own.
Also, trait objects are not very friendly with OnceCell or whatever would be used for the global registry. It will require synchronization too.
Finally, safe trait objects won't be able to incapsulate default values for attributes. Putting Value inside trait object will make trait non-safe, it would require manual implemenation of Serialize/Deserialize traits and so on. I honestly tried to implement it and might miss something but implementation looks rather messy
Ok but still have questions. Is global registry ok? It will bind a set of available attributes to the compiled version of application. In the approach implemented here indices can manage a set of attributes by their own.
True, the registry should probably have some attribute_name attribute association. The predefined attribute names would be reserved. So the current json format could stay the same, but conjunctive_bool
would be an attribute_name.
{"is_frozen:{conjunctive_bool:false}}
Finally, safe trait objects won't be able to incapsulate default values for attributes. Putting Value inside trait object will make trait non-safe, it would require manual implemenation of Serialize/Deserialize traits and so on. I honestly tried to implement it and might miss something but implementation looks rather messy
For de/serialization, you can just use the serde_json::Value.
Putting Value inside trait object will make trait non-safe
Why put Value inside the trait object, it's just merge behavior.
trait Attribute{
fn merge(attrs: Vec<serde_json::Value>) -> serde_json::Value; // There's no self
}
Finally, safe trait objects won't be able to incapsulate default values for attributes.
Why do you need default, only for merge right?
trait Attribute{
// handle default internally
fn merge(attrs: Vec<Option<serde_json::Value>>) -> serde_json::Value;
}
No global registry please.
Why do you need default, only for merge right?
Default is needed for any newly created segment, not just merge.
For example, this is_frozen
attribute should be set to false for new segments, manually turned to true during specific event (in my case it is index publication that freezes segments and should prevent them from further merging) and again, set to false after any merge (can still happen if run it manually). Here the default is false.
@ppodolsky The idea to have the serialized object embed the logic of how things should be merged is not interesting, but I don't think we want that here.
Here is a counter proposition...
What about you have a public trait.
pub trait SegmentAttribute: Default + Serializer + Deserializer {
fn merge(segment_attributes: &Self);
}
Internally to tantivy, you could then have a SegmentAttributeMergerImpl
.
struct SegmentAttributeMergerImpl<S: SegmentAttribute>;
And a boxable trait (also private to tantivy)
trait SegmentAttributeMerger {
fn merge_json(&self, segment_attributes_json: &[serde_json::Value]) -> serde_json::Value;
}
The index then would have public method
fn set_segment_attr::<S: SegmentAttribute>() {
let typed_segment_attr_merger: SegmentAttributeMergerImpl<S> = SegmentAttributeMergerImpl;
self.segment_attribute_merger = Some(Box::new(typed_segment_attr_merger));
}
Note that we are not making Index a generic here.
@fulmicoton Done :)
Friendly bump ⬆️ @fulmicoton
Bump:)
Regular bump @PSeitz @fulmicoton
@fulmicoton Done, sorry for being late.
Also, had to turn Box into Arc because you suggested to remove impl Clone for Box<...>
etc.