vespa icon indicating copy to clipboard operation
vespa copied to clipboard

Ability to index (hnsw) multiple vector points per document

Open jobergum opened this issue 3 years ago • 12 comments

With the growing interest in ANN with Vespa there has also been increased demand for the ability to index multiple vectors per document. If any of the vectors assigned to a document is in the top-k closest neighbors it should be retrieved and exposed to the first-phase ranking function. This is similar to the behaviour with position type fields (array) and geo-search.

Current support

field vector type tensor<float>(x[128]) {
  indexing: attribute |index
}

Futur support:

field vector type tensor<float>(s{}, x[128]) {
  indexing: attribute |index
}

The ranking features like closeness etc returns the value of the closest vector which made the document to be retrieved and exposed to the first-phase ranking function where users can do more advanced computations over all vectors if they want to.

Query syntax is unchanged but we might consider allowing passing N tensors with the query as well where for each v in N we do ANN(document_,v) with the above semantics if the document has multiple positions. This would simplify expressing the end to end retrieval using colBERT (https://arxiv.org/abs/2004.12832).

jobergum avatar Dec 21 '20 08:12 jobergum

Excited to see this making its way to Vespa! (First author of the ColBERT paper.)

Happy to help with ideas/etc. about doing this. FYI our implementation is here: https://github.com/stanford-futuredata/ColBERT/tree/v0.2

okhat avatar Dec 21 '20 21:12 okhat

Thanks for reaching out @okhat and excellent work on the paper and also making the source code for training available. I've used the main branch for now and I'm able to train and reproduce your re-ranking metrics using cosine similarity. We reduced dimensionality to 32 to save space and memory bandwidth during evaluation (Vespa is currently cpu only).

Well done, it's not every paper and released code where it's easy to reproduce the reported results!

Metrics on top1000.dev file (as provided by MS Marco):

MRR@10 = 0.343441408559603
Recall@50 = 0.7450095510983765
Recall@200 = 0.7996657115568291
Recall@1000 = 0.8140401146131805
[Dec 21, 15:41:39] #> checkpoint['batch'] = 400000 

Given the rather bad recall@1K in the original top1000 the MRR level is very impressive

We are able to represent your MaxSim operator using Vespa tensor functions (*1) and usable as a re-ranker. Adding support for indexing multiple document term representations for the same passage/document will make it easier to also use colbert for retrieval. Though I think it can also be distilled into a single dense tensor representation with ok recall@k.

We use the following document schema for MS Marco Passage ranking experiments (so far). The end work will be published as a sample application and probably an entry to the MS Marco Passage ranking leaderboard where we will use multiple features in the re-ranking stage.

document passage {

    field text type string {
      indexing: summary | index
      index: enable-bm25
    }

    field id type long {
      indexing: summary |attribute
    }

    field dt type tensor<float>(dt{}, x[32]){
      indexing: summary | attribute
    }

    field doc_t5_query type weightedset<string>{
      indexing: summary |index
      index: enable-bm25
    }

  }

Ranking profile:

 rank-profile colbert {
    first-phase {
       expression:...
    }
    second-phase {
      rerank-count:1000
      expression {
          sum(
            reduce(
              sum(
                query(qt) * attribute(dt), x
              ),
              max, dt
            ),
           qt
         )
      }
    }
  }

Where query(qt) is the run time tensor with query term to embedding representation defined as

<field name="ranking.features.query(qt)" type="tensor(qt{},x[32])"/>

The query tensor is computed by the exported bert query model (onnx format for serving). From a latency perspective the encoder is the overall bottleneck as the rest can be scaled easily while the single none-batched pass through Bert cannot so it might be that we try to use a smaller bert version (e.g bert-medium) and with quantisation.

It's a very nice use case for us as it allows us to demonstrate our mixed tensor support with advanced tensor functions and representing bert models in Vespa.

(*1) https://docs.vespa.ai/documentation/reference/ranking-expressions.html#tensor-functions

jobergum avatar Dec 22 '20 09:12 jobergum

Very interesting---thanks @jobergum!

I see that you're using d=32. Another way to go about this is to use a larger dimension (say 128) but quantize the representations. This may perform better in terms of accuracy.

It's also interesting that you think (on CPU) the main bottleneck would be the query encoder. I think it could be pretty fast (tens of milliseconds), even with a modest few-core CPU.

There are some ONNX results in milliseconds here from this article. For bsize=1 and maxlen=32, they're doing inference in 59.84 ms on a commodity CPU.

okhat avatar Dec 22 '20 16:12 okhat

@okhat Yes, agree on dimensionality, but currently our tensor values are only float or double, we plan to add more tensor types with lower resolution (e.g int8). Generally it seems like number of parameters/dimensions is more important than weight resolution. So 32 is a reasonable tradeoff given our current tensor type limitations.

There are some ONNX results in milliseconds here from this article. For bsize=1 and maxlen=32, they're doing inference in >59.84 ms on a commodity CPU.

Yes, we will fully explore this (We also did similar accuracy versus speed for the DPR model https://blog.vespa.ai/from-research-to-production-scaling-a-state-of-the-art-machine-learning-system/)

Below 100ms overall with 1 node including ranking and query encoding is our goal (with 1 node of modern hw) and that is fully reachable without much accuracy loss. I'll update this ticket when we have more on this, but it's surely a very exciting model you have proposed and fits perfectly with Vespa tensor functions and storage.

jobergum avatar Dec 22 '20 16:12 jobergum

With a bert-medium-uncased model (https://huggingface.co/google/bert_uncased_L-8_H-512_A-8), with batch size 300 and dim 32 and trained for 120K iterations the accuracy is pretty good (0.354 MRR@10 on MS Marco Dev). Using just WAND(1000) BM25 as retriever over text and re-ranking top 1K using ColBERT:

MS Marco Passage Ranking Dev:

  • MRR@10 0.3540
  • Recall@10 0.6141
  • Recall@50 0.7781
  • Recall@100 0.8151
  • Recall@1000 0.8562

TREC DL 2019 Eval (43 queries) with deep graded judgements

  • NDCG@10 0.7039
  • MAP 0.4494

Sample response image

jobergum avatar Jan 05 '21 18:01 jobergum

That's really awesome! The results with medium BERT and a small dimension are better than I expected. Also the latency is pretty great. Is this on CPU?

I'm wondering how things would look with end-to-end retrieval, especially with the deeper relevance judgments!

okhat avatar Jan 05 '21 19:01 okhat

Yes, this runs on a single node in our private Vespa cloud with 2 x Xeon Gold 2.60GHz and 512 GB memory. Pretty standard stuff. We'll explore more on the performance, number of threads per search and see if we can optimize it further. We can also cut down the latency of the query encoder by using a quantized version (int8) which brings down the cost of encoding the query, but with an accuracy loss (from 0.35ish to 0.34 ish).

My conclusion is that it's perfectly practical to serve this model in production using Vespa and with a fraction of the cost (and latency) of a full interaction based model with almost the same accuracy.

jobergum avatar Jan 05 '21 19:01 jobergum

Unrelated to this ticket directly but an update on Colbert realization on Vespa. https://github.com/vespa-engine/vespa/pull/16236/ adds an optimization to the tensor expression which implements the Colbert MaxSim operator and we now reach end to end 2.2K QPS with low latency on a single node.

End to end latency includes

  • HTTPS end to end to the result is fully consumed by the http benchmarking client
  • Bert query tokenization to bert token_id
  • Encoding of the query through the BERT query encoder (ONNX and ONNX-RT)
  • Re-write of the d0[1],d1[32],d2[32] tensor to sparse qt tensor qt{},x[32] which is used in the ranking expression
  • sparse text retrieval using WAND over 9M passages and re-ranking using the Colbert MaxSim operator

Results:

Vespa version average latency 99p latency QPS
7.347.28 92.89 119.74 686
7.349.32 28.21 43.90 2267

This is also using 1 thread per search only so for use cases which are not that throughput sensitive one might reduce latency further at the cost of throughput.

We also did a submission to MS Marco Passage ranking with it image

So thanks again @okhat for sharing the code to reproduce the training routine and the excellent paper. We will write more about this on our blog real soon.

jobergum avatar Jan 28 '21 12:01 jobergum

Thank you @jobergum and thanks to the team! This is actually much faster than I thought.

I'm wondering how fast you guys could make the end-to-end retrieval version of ColBERT too if you gave that one a shot! ;-) Especially with iterated (gathering negatives on the training set based on the trained ColBERT model), this should have substantially higher accuracy (~37.5% MRR@10 on dev).

okhat avatar Jan 28 '21 14:01 okhat

So by iterated you mean creating a new triplet set over the train queries using the positives and a sample of the top negatives from the initial ColBERT model trained on the original bm25 triplet file?

We will publish the weights of this model on Huggingface models and also the pre-computed passsage tensors so people can reproduce. We will leave it to those interested to improve it and iterate over it, after all we are not scientists, we just want to demonstrate how one can efficiently represent the excellent ColBERT model and have close to state of the art ranking with great serving performance on cpu architectures. In this context we are rather happy with a MRR score of 0.35ish on dev and eval as it's 100% better than plain BM25, similar on TREC DL 2019.

Allowing representing ColBERT on Vespa has IMHO many benefits and especially cpu friendliness, and also Vespa allows the the final inference to be a combination of multiple ranking features. Later this year we will also add support for other tensor data types so we can reduce the memory footprint.

jobergum avatar Feb 01 '21 12:02 jobergum

Discussed this with @jobergum. We concluded this is less important. Setting to Milestone later.

geirst avatar Sep 01 '21 11:09 geirst

In the context of the ColBERT multi representation model this is less important as there are alternatives using a single dense representation for the retriever and use ColBERT as the re-ranking step.

Now with paged on-disk tensors with binarization and unpacking the memory and storage footprint allows storing 100M passages with fixed length 256 tokens on 16-64-600 nodes.

jobergum avatar Sep 01 '21 12:09 jobergum

@geirst I think this has been completed now.

baldersheim avatar Sep 13 '23 20:09 baldersheim

Yes, this is part of Vespa 8.144.19. See blog post for details: https://blog.vespa.ai/semantic-search-with-multi-vector-indexing/

geirst avatar Sep 14 '23 08:09 geirst