lance icon indicating copy to clipboard operation
lance copied to clipboard

perf: implement XTR for retrieving multivector

Open BubbleCal opened this issue 9 months ago • 3 comments

this PR introduces XTR, which can score the documents without the original multivector, so we don't need any IO op for searching on multivector.

it sets the minimum similarity as the estimated similarity for missed documents of single query vector.

BubbleCal avatar Feb 08 '25 06:02 BubbleCal

Codecov Report

Attention: Patch coverage is 85.31746% with 37 lines in your changes missing coverage. Please review.

Project coverage is 78.48%. Comparing base (33ae43b) to head (8d5a835).

Files with missing lines Patch % Lines
rust/lance/src/io/exec/knn.rs 85.44% 24 Missing and 7 partials :warning:
rust/lance/src/dataset/scanner.rs 79.31% 1 Missing and 5 partials :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3437      +/-   ##
==========================================
- Coverage   78.48%   78.48%   -0.01%     
==========================================
  Files         252      252              
  Lines       94011    94220     +209     
  Branches    94011    94220     +209     
==========================================
+ Hits        73783    73947     +164     
- Misses      17232    17279      +47     
+ Partials     2996     2994       -2     
Flag Coverage Δ
unittests 78.48% <85.31%> (-0.01%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Feb 08 '25 08:02 codecov-commenter

Something seems off in the algorithm, with how missed_similarities is handled. Could you address my comment, and also maybe write a unit tests that shows we get correct results? out of this?

we have tests here https://github.com/lancedb/lance/pull/3437/files#diff-6de816b72e7c722316243c57df4f809ad34dc8581367c72335154dada48c40edL993

BubbleCal avatar Feb 11 '25 04:02 BubbleCal

Something seems off in the algorithm, with how missed_similarities is handled. Could you address my comment, and also maybe write a unit tests that shows we get correct results? out of this?

we have tests here https://github.com/lancedb/lance/pull/3437/files#diff-6de816b72e7c722316243c57df4f809ad34dc8581367c72335154dada48c40edL993

I meant more text the XTR algorithm itself was working as expected. Part of why I'm having a hard time understand this PR is there are no tests showing the expected behavior of the algorithm.

wjones127 avatar Feb 18 '25 16:02 wjones127