lucene De-dup raw vectors?

Description

Closes #14758

Demonstrating the proposal to de-duplicate raw vectors in Lucene! Note: Right now this is very crude, and only for demonstration purposes.

Nov 20 '25 04:11 kaivalnp

File Layout

Today, the .vec file is partitioned per-field, and looks like:

# field 1 begin:  
(vector for field 1, document d1) # position x0  
(vector for field 1, document d2)  
(vector for field 1, document d3)  
# field 1 end, field 2 begin:  
(vector for field 2, document d1) # position x1  
(vector for field 2, document d3)  
# field 2 end, field 3 begin:  
(vector for field 3, document d1) # position x2  
(vector for field 3, document d2)  
# field 3 end, and so on...

The .vem file contains per-field tuples to denote (position, length) of the corresponding vector "block":

# (field number, offset of vector "block", length of vector "block", ...)  
# "..." represents other metadata, including dimension, ord -> doc mapping, etc.
(1, x0, x1 - x0, ...)  
(2, x1, x2 - x1, ...)  
# and so on...

Proposing to change the .vec file to be partitioned per-document instead, something like:

# document d1 begin:  
(vector for field 1, document d1) # position x0  
(vector for field 2, document d1) # position x1  
(vector for field 3, document d1) # position x2  
# document d1 end, document 2 begin:  
(vector for field 1, document d2) # position x3  
(vector for field 3, document d2) # position x4  
# document d2 end, document 3 begin:  
(vector for field 1, document d3) # position x5  
(vector for field 2, document d3) # position x6  
# document d3 end, and so on...

Correspondingly, the .vem file will contain per-field mappings of ord -> position of vector in the raw file:

# (field number, ord -> position mapping as array, ...)  
# "..." represents other metadata, including dimension, ord -> doc mapping, etc. which is unchanged  
(1, [x0, x3, x5], ...) # {ord 0 -> position x0, ord 1 -> position x3, ord 2 -> position x5}  
(2, [x1, x4], ...) # {ord 0 -> position x1, ord 1 -> position x4}  
(3, [x2, x6], ...) # {ord 0 -> position x2, ord 1 -> position x6}  
# and so on...

In case of duplicate vectors within a document, we can simply "point" to a pre-existing vector, without writing another copy on disk!

Earlier, the offset of the vector at ordinal ord in field f was calculated by seeking to ord * vectorByteSize inside the vector "block" of field f.

Now, we're storing an additional ord -> position of vector mapping to "point" to the vector in the raw vector file, also used during search.

Nov 20 '25 04:11 kaivalnp

Notes

Right now this is a crude implementation, rough and inefficient, only for demonstration purposes!
Basically a copy of Lucene99FlatVectors*, except that raw vectors are de-duped and written according to the new layout^ during flush
- Additionally, an ord -> position of vector mapping is stored and used during searching
Does not support an index sort yet
Does not support merging yet
- This is mainly an API challenge, because vector merging is expected to be field-by-field -- but seems doable with a new finishMerge API that does the equivalent of flush?

Benchmark

In order to index everything in a single segment, I had to:

Set number of indexing threads to 1
Increase the writer buffer to be sufficiently high for all vectors

Made use of the option added in https://github.com/mikemccand/luceneutil/pull/468 (filterStrategy) -- which creates and searches a separate KNN field with a subset of documents (with index-time-filter)

Cohere vectors, 768d, MAXIMUM_INNER_PRODUCT similarity

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)          filterStrategy  filterSelectivity  vec_disk(MB)  vec_RAM(MB)  indexType
 0.899        3.505   3.497        0.998  100000   100      50       32        250         no     6995    157.61        634.49            0.01             1          299.24   query-time-pre-filter               0.50       292.969      292.969       HNSW
 0.915        3.266   3.258        0.998  100000   100      50       32        250         no     5742      0.00      Infinity            0.11             1          299.24   query-time-pre-filter               0.20       292.969      292.969       HNSW
 0.903        2.351   2.343        0.997  100000   100      50       32        250         no     3657      0.00      Infinity            0.10             1          299.24   query-time-pre-filter               0.10       292.969      292.969       HNSW
 1.000        0.357   0.349        0.978  100000   100      50       32        250         no     1039      0.00      Infinity            0.10             1          299.24   query-time-pre-filter               0.01       292.969      292.969       HNSW
 0.498        1.185   1.178        0.994  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          299.24  query-time-post-filter               0.50       292.969      292.969       HNSW
 0.202        1.165   1.157        0.993  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          299.24  query-time-post-filter               0.20       292.969      292.969       HNSW
 0.100        1.230   1.222        0.993  100000   100      50       32        250         no     3986      0.00      Infinity            0.11             1          299.24  query-time-post-filter               0.10       292.969      292.969       HNSW
 0.010        1.181   1.173        0.993  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          299.24  query-time-post-filter               0.01       292.969      292.969       HNSW
 0.940        1.065   1.057        0.992  100000   100      50       32        250         no     3939    258.24        387.23            0.01             1          449.10       index-time-filter               0.50       292.969      292.969       HNSW
 0.961        0.913   0.906        0.992  100000   100      50       32        250         no     3568    196.23        509.62            0.01             1          359.42       index-time-filter               0.20       292.969      292.969       HNSW
 0.976        0.679   0.671        0.988  100000   100      50       32        250         no     3172    167.67        596.42            0.01             1          329.38       index-time-filter               0.10       292.969      292.969       HNSW
 1.000        0.160   0.152        0.950  100000   100      50       32        250         no      984    155.23        644.19            0.01             1          302.16       index-time-filter               0.01       292.969      292.969       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)          filterStrategy  filterSelectivity  vec_disk(MB)  vec_RAM(MB)  indexType
 0.904        3.705   3.697        0.998  100000   100      50       32        250         no     6982    159.17        628.28            0.01             1          300.00   query-time-pre-filter               0.50       292.969      292.969       HNSW
 0.917        3.455   3.448        0.998  100000   100      50       32        250         no     5786      0.00      Infinity            0.10             1          300.00   query-time-pre-filter               0.20       292.969      292.969       HNSW
 0.900        2.437   2.430        0.997  100000   100      50       32        250         no     3553      0.00      Infinity            0.10             1          300.00   query-time-pre-filter               0.10       292.969      292.969       HNSW
 1.000        0.366   0.358        0.978  100000   100      50       32        250         no     1023      0.00      Infinity            0.10             1          300.00   query-time-pre-filter               0.01       292.969      292.969       HNSW
 0.506        1.263   1.255        0.994  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          300.00  query-time-post-filter               0.50       292.969      292.969       HNSW
 0.206        1.255   1.247        0.994  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          300.00  query-time-post-filter               0.20       292.969      292.969       HNSW
 0.100        1.257   1.249        0.994  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          300.00  query-time-post-filter               0.10       292.969      292.969       HNSW
 0.010        1.287   1.279        0.994  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          300.00  query-time-post-filter               0.01       292.969      292.969       HNSW
 0.940        1.138   1.130        0.993  100000   100      50       32        250         no     3927    249.90        400.17            0.01             1          303.57       index-time-filter               0.50       292.969      292.969       HNSW
 0.963        1.001   0.993        0.992  100000   100      50       32        250         no     3598    188.64        530.11            0.01             1          301.41       index-time-filter               0.20       292.969      292.969       HNSW
 0.977        0.791   0.783        0.990  100000   100      50       32        250         no     3159    168.33        594.09            0.01             1          300.65       index-time-filter               0.10       292.969      292.969       HNSW
 1.000        0.209   0.201        0.962  100000   100      50       32        250         no     1023    155.47        643.22            0.01             1          300.05       index-time-filter               0.01       292.969      292.969       HNSW

Note the reduction in index_size(MB) (when index-time-filter is used) due to re-use of raw vectors! There is a slight increase in latency with this PR, presumably because of the extra lookup step of the vector position..

Nov 20 '25 04:11 kaivalnp