tantivy
tantivy copied to clipboard
Reverse reverse index
I'm wanting to look up the terms for a given document, that is Document -> Field -> Terms. Something like what term vectors provide in Lucene. (However, I see positions are already stored a little different in Tantivy.)
The use case is things like analysing the term distributions in a document, (for text classification, summarization, highlighting query terms) and copying an individual, indexed documents to another index.
I'm thinking of this like a HashMap<DocId, Vec<Term>> where Term (somehow) is a reference to the Term in the termdict and there will be one of these HashMap<,> reverse reverse indexes per reverse index segment so we (somehow) need to participate in the merge process. I notice that Lucene.NET has an interface 'IntervedDocConsumer' which is how term vectors (and something called 'freqprox') hook into the indexing chain so maybe that's a place to draw inspiration.
Edit: It looks like Recorder might be the right interface in Tantivy for writing to this new index. For example, TermFrequenceRecorder.
Can you share any initial thoughts in how you might approach this? Even the very first things that come to your mind will likely greatly accelerate me if I am to try and extend Tantivy to support this kind of index.
Some notes as I poke around Tantivy.
Storing terms as a fastfield
I thought this might be implemented with fastfields (columnar storage) which "is designed for the fast random access of some document fields given a document id". ~~However, I can't see a way to actually reading a fastfield for a given document id. Maybe I want something row-oriented instead?~~
Edit: Oh!
Perhaps as a minimal first pass, I could just mark my existing text field as a fast field and specify a tokenizer. Then, wrap a column reader like FacetReader does, to allow fetching by a document id.
https://github.com/quickwit-oss/tantivy/blob/2f5a269c70855afb983fa3fb5d299fa0001f713f/src/fastfield/writer.rs#L134-L146
This method would mean my text is getting tokenized twice though, and I'd be storing the whole term(?) rather than just a term ordinal. https://github.com/quickwit-oss/tantivy/pull/1325, implementation of fastfield for strings might be relevant here. (However, it looks like the codebase has changed a lot since this PR. For example, the postings writer no longer seems to pass an 'unordered_term_id' to the fastfield module.)
Getting the terms per document
The token stream for a document is processed into terms in PostingsWriter.index_test
https://github.com/quickwit-oss/tantivy/blob/2f5a269c70855afb983fa3fb5d299fa0001f713f/src/postings/postings_writer.rs#L138-L155
PostingsWriters must implement a subscribe function to handle the doc and term.
https://github.com/quickwit-oss/tantivy/blob/2f5a269c70855afb983fa3fb5d299fa0001f713f/src/postings/postings_writer.rs#L201-L221
SpecializedPostingsWriter<>, for example, instantiates and calls a recorder for each term. However, a recorder does not (currently) have access to the specific term.
https://github.com/quickwit-oss/tantivy/blob/2f5a269c70855afb983fa3fb5d299fa0001f713f/src/postings/recorder.rs#L49-L85
Exposing as an option
This could be exposed as another (or a different kind of) IndexRecordOption which should work if we implement a new recorder or need implement a whole different PostingsWriter.
https://github.com/quickwit-oss/tantivy/blob/2f5a269c70855afb983fa3fb5d299fa0001f713f/src/postings/per_field_postings_writer.rs#L33-L46
Alternatively, it might be nice to enable this per document. For example, so I can just keep this kinda index for the latest ~20% of documents. In which case, maybe this could be implemented as a new field type.
You could load the document from the docstore and tokenize the text to get the terms
You could load the document from the docstore and tokenize the text to get the terms
Ah, but I am not storing this (quite large) text field
The fast field (columnar) version should work too I think, but the dictionary is not shared between the inverted index and the columnar storage
The fast field (columnar) version should work too I think, but the dictionary is not shared between the inverted index and the columnar storage
Did it used to be shared? I see your contribution here is passing an UnorderedTermId to a column writer.
Edit: Maybe that's what this is about https://github.com/quickwit-oss/tantivy/issues/1705#issuecomment-1334716250
It can't be shared anymore since a different tokenizer can be defined now