tantivy Reverse reverse index

I'm wanting to look up the terms for a given document, that is Document -> Field -> Terms. Something like what term vectors provide in Lucene. (However, I see positions are already stored a little different in Tantivy.)

The use case is things like analysing the term distributions in a document, (for text classification, summarization, highlighting query terms) and copying an individual, indexed documents to another index.

I'm thinking of this like a HashMap<DocId, Vec<Term>> where Term (somehow) is a reference to the Term in the termdict and there will be one of these HashMap<,> reverse reverse indexes per reverse index segment so we (somehow) need to participate in the merge process. I notice that Lucene.NET has an interface 'IntervedDocConsumer' which is how term vectors (and something called 'freqprox') hook into the indexing chain so maybe that's a place to draw inspiration. Edit: It looks like Recorder might be the right interface in Tantivy for writing to this new index. For example, TermFrequenceRecorder.

Can you share any initial thoughts in how you might approach this? Even the very first things that come to your mind will likely greatly accelerate me if I am to try and extend Tantivy to support this kind of index.

Oct 04 '24 05:10 NickDarvey

Some notes as I poke around Tantivy.

Storing terms as a fastfield

I thought this might be implemented with fastfields (columnar storage) which "is designed for the fast random access of some document fields given a document id". ~~However, I can't see a way to actually reading a fastfield for a given document id. Maybe I want something row-oriented instead?~~

Edit: Oh!

Perhaps as a minimal first pass, I could just mark my existing text field as a fast field and specify a tokenizer. Then, wrap a column reader like FacetReader does, to allow fetching by a document id.

https://github.com/quickwit-oss/tantivy/blob/2f5a269c70855afb983fa3fb5d299fa0001f713f/src/fastfield/writer.rs#L134-L146

This method would mean my text is getting tokenized twice though, and I'd be storing the whole term(?) rather than just a term ordinal. https://github.com/quickwit-oss/tantivy/pull/1325, implementation of fastfield for strings might be relevant here. (However, it looks like the codebase has changed a lot since this PR. For example, the postings writer no longer seems to pass an 'unordered_term_id' to the fastfield module.)

Getting the terms per document

The token stream for a document is processed into terms in PostingsWriter.index_test https://github.com/quickwit-oss/tantivy/blob/2f5a269c70855afb983fa3fb5d299fa0001f713f/src/postings/postings_writer.rs#L138-L155

PostingsWriters must implement a subscribe function to handle the doc and term. https://github.com/quickwit-oss/tantivy/blob/2f5a269c70855afb983fa3fb5d299fa0001f713f/src/postings/postings_writer.rs#L201-L221

SpecializedPostingsWriter<>, for example, instantiates and calls a recorder for each term. However, a recorder does not (currently) have access to the specific term. https://github.com/quickwit-oss/tantivy/blob/2f5a269c70855afb983fa3fb5d299fa0001f713f/src/postings/recorder.rs#L49-L85

Exposing as an option

This could be exposed as another (or a different kind of) IndexRecordOption which should work if we implement a new recorder or need implement a whole different PostingsWriter.

https://github.com/quickwit-oss/tantivy/blob/2f5a269c70855afb983fa3fb5d299fa0001f713f/src/postings/per_field_postings_writer.rs#L33-L46

Alternatively, it might be nice to enable this per document. For example, so I can just keep this kinda index for the latest ~20% of documents. In which case, maybe this could be implemented as a new field type.

Oct 07 '24 04:10 NickDarvey

You could load the document from the docstore and tokenize the text to get the terms

Oct 07 '24 05:10 PSeitz

You could load the document from the docstore and tokenize the text to get the terms

Ah, but I am not storing this (quite large) text field

Oct 07 '24 05:10 NickDarvey

The fast field (columnar) version should work too I think, but the dictionary is not shared between the inverted index and the columnar storage

Oct 07 '24 05:10 PSeitz

The fast field (columnar) version should work too I think, but the dictionary is not shared between the inverted index and the columnar storage

Did it used to be shared? I see your contribution here is passing an UnorderedTermId to a column writer.

Edit: Maybe that's what this is about https://github.com/quickwit-oss/tantivy/issues/1705#issuecomment-1334716250

Oct 07 '24 05:10 NickDarvey

It can't be shared anymore since a different tokenizer can be defined now

Oct 07 '24 07:10 PSeitz

tantivy tantivy copied to clipboard

Reverse reverse index

Storing terms as a fastfield

Getting the terms per document

Exposing as an option

tantivy
tantivy copied to clipboard