vespa icon indicating copy to clipboard operation
vespa copied to clipboard

Feature Request: Access to sparse vectors (TF-IDF or BM25) as generated by Vespa indexing

Open tsaltena opened this issue 3 years ago • 3 comments

Is your feature request related to a problem? Please describe. We opted for Vespa because of it's ability to cope with sparse and dense vectors at the same time. In some of our scenario's we want to compare a group of documents to the rest of the data. To do that, we combine a number of stored documents into a composite vector. Therefore, we would like to be able to access the sparse vectors generated during Vespa indexing and do some computation on them, before feeding them back into a closeness query.

Describe the solution you'd like We'd need to ways to interact:

  1. Get the raw vectors as part of the document summary in a query
  2. Use these raw vectors in the closeness scoring

Describe alternatives you've considered As a hack, we could use a dedicated tensor field store generated sparse embeddings (or perhaps copy them over from the index?), but this feels like a waste of resources.

tsaltena avatar Jan 04 '22 13:01 tsaltena

I don't understand what you mean by "sparse vectors generated during Vespa indexing" - could you explain some more?

bratseth avatar Jan 04 '22 17:01 bratseth

With these sparse vectors I mean the actual document-term matrix rows per document, my assumption was that they will be stored somewhere as a basis for the BM25 ranking if a bm25 index is enabled?

tsaltena avatar Jan 05 '22 08:01 tsaltena

Right, so you'd like access to a sparse vector of term -> frequency for a field in a document. Yes, that's doable, although not something that's directly available - Vespa uses posting lists with one entry per occurrence to enable positional ranking.

Related, see the textSimilarity features in https://docs.vespa.ai/en/reference/rank-features.html which give you a measure of document similarity that also uses positional information

bratseth avatar Jan 05 '22 10:01 bratseth