De-dup raw vectors?
Description
Closes #14758
Demonstrating the proposal to de-duplicate raw vectors in Lucene! Note: Right now this is very crude, and only for demonstration purposes.
File Layout
Today, the .vec file is partitioned per-field, and looks like:
# field 1 begin:
(vector for field 1, document d1) # position x0
(vector for field 1, document d2)
(vector for field 1, document d3)
# field 1 end, field 2 begin:
(vector for field 2, document d1) # position x1
(vector for field 2, document d3)
# field 2 end, field 3 begin:
(vector for field 3, document d1) # position x2
(vector for field 3, document d2)
# field 3 end, and so on...
The .vem file contains per-field tuples to denote (position, length) of the corresponding vector "block":
# (field number, offset of vector "block", length of vector "block", ...)
# "..." represents other metadata, including dimension, ord -> doc mapping, etc.
(1, x0, x1 - x0, ...)
(2, x1, x2 - x1, ...)
# and so on...
Proposing to change the .vec file to be partitioned per-document instead, something like:
# document d1 begin:
(vector for field 1, document d1) # position x0
(vector for field 2, document d1) # position x1
(vector for field 3, document d1) # position x2
# document d1 end, document 2 begin:
(vector for field 1, document d2) # position x3
(vector for field 3, document d2) # position x4
# document d2 end, document 3 begin:
(vector for field 1, document d3) # position x5
(vector for field 2, document d3) # position x6
# document d3 end, and so on...
Correspondingly, the .vem file will contain per-field mappings of ord -> position of vector in the raw file:
# (field number, ord -> position mapping as array, ...)
# "..." represents other metadata, including dimension, ord -> doc mapping, etc. which is unchanged
(1, [x0, x3, x5], ...) # {ord 0 -> position x0, ord 1 -> position x3, ord 2 -> position x5}
(2, [x1, x4], ...) # {ord 0 -> position x1, ord 1 -> position x4}
(3, [x2, x6], ...) # {ord 0 -> position x2, ord 1 -> position x6}
# and so on...
In case of duplicate vectors within a document, we can simply "point" to a pre-existing vector, without writing another copy on disk!
Earlier, the offset of the vector at ordinal ord in field f was calculated by seeking to ord * vectorByteSize inside the vector "block" of field f.
Now, we're storing an additional ord -> position of vector mapping to "point" to the vector in the raw vector file, also used during search.
Notes
- Right now this is a crude implementation, rough and inefficient, only for demonstration purposes!
- Basically a copy of
Lucene99FlatVectors*, except that raw vectors are de-duped and written according to the new layout^ duringflush- Additionally, an
ord -> position of vectormapping is stored and used during searching
- Additionally, an
- Does not support an index sort yet
- Does not support merging yet
- This is mainly an API challenge, because vector merging is expected to be field-by-field -- but seems doable with a new
finishMergeAPI that does the equivalent offlush?
- This is mainly an API challenge, because vector merging is expected to be field-by-field -- but seems doable with a new
Benchmark
In order to index everything in a single segment, I had to:
- Set number of indexing threads to 1
- Increase the writer buffer to be sufficiently high for all vectors
Made use of the option added in https://github.com/mikemccand/luceneutil/pull/468 (filterStrategy) -- which creates and searches a separate KNN field with a subset of documents (with index-time-filter)
Cohere vectors, 768d, MAXIMUM_INNER_PRODUCT similarity
main
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) filterStrategy filterSelectivity vec_disk(MB) vec_RAM(MB) indexType
0.899 3.505 3.497 0.998 100000 100 50 32 250 no 6995 157.61 634.49 0.01 1 299.24 query-time-pre-filter 0.50 292.969 292.969 HNSW
0.915 3.266 3.258 0.998 100000 100 50 32 250 no 5742 0.00 Infinity 0.11 1 299.24 query-time-pre-filter 0.20 292.969 292.969 HNSW
0.903 2.351 2.343 0.997 100000 100 50 32 250 no 3657 0.00 Infinity 0.10 1 299.24 query-time-pre-filter 0.10 292.969 292.969 HNSW
1.000 0.357 0.349 0.978 100000 100 50 32 250 no 1039 0.00 Infinity 0.10 1 299.24 query-time-pre-filter 0.01 292.969 292.969 HNSW
0.498 1.185 1.178 0.994 100000 100 50 32 250 no 3986 0.00 Infinity 0.10 1 299.24 query-time-post-filter 0.50 292.969 292.969 HNSW
0.202 1.165 1.157 0.993 100000 100 50 32 250 no 3986 0.00 Infinity 0.10 1 299.24 query-time-post-filter 0.20 292.969 292.969 HNSW
0.100 1.230 1.222 0.993 100000 100 50 32 250 no 3986 0.00 Infinity 0.11 1 299.24 query-time-post-filter 0.10 292.969 292.969 HNSW
0.010 1.181 1.173 0.993 100000 100 50 32 250 no 3986 0.00 Infinity 0.10 1 299.24 query-time-post-filter 0.01 292.969 292.969 HNSW
0.940 1.065 1.057 0.992 100000 100 50 32 250 no 3939 258.24 387.23 0.01 1 449.10 index-time-filter 0.50 292.969 292.969 HNSW
0.961 0.913 0.906 0.992 100000 100 50 32 250 no 3568 196.23 509.62 0.01 1 359.42 index-time-filter 0.20 292.969 292.969 HNSW
0.976 0.679 0.671 0.988 100000 100 50 32 250 no 3172 167.67 596.42 0.01 1 329.38 index-time-filter 0.10 292.969 292.969 HNSW
1.000 0.160 0.152 0.950 100000 100 50 32 250 no 984 155.23 644.19 0.01 1 302.16 index-time-filter 0.01 292.969 292.969 HNSW
This PR
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) filterStrategy filterSelectivity vec_disk(MB) vec_RAM(MB) indexType
0.904 3.705 3.697 0.998 100000 100 50 32 250 no 6982 159.17 628.28 0.01 1 300.00 query-time-pre-filter 0.50 292.969 292.969 HNSW
0.917 3.455 3.448 0.998 100000 100 50 32 250 no 5786 0.00 Infinity 0.10 1 300.00 query-time-pre-filter 0.20 292.969 292.969 HNSW
0.900 2.437 2.430 0.997 100000 100 50 32 250 no 3553 0.00 Infinity 0.10 1 300.00 query-time-pre-filter 0.10 292.969 292.969 HNSW
1.000 0.366 0.358 0.978 100000 100 50 32 250 no 1023 0.00 Infinity 0.10 1 300.00 query-time-pre-filter 0.01 292.969 292.969 HNSW
0.506 1.263 1.255 0.994 100000 100 50 32 250 no 3986 0.00 Infinity 0.10 1 300.00 query-time-post-filter 0.50 292.969 292.969 HNSW
0.206 1.255 1.247 0.994 100000 100 50 32 250 no 3986 0.00 Infinity 0.10 1 300.00 query-time-post-filter 0.20 292.969 292.969 HNSW
0.100 1.257 1.249 0.994 100000 100 50 32 250 no 3986 0.00 Infinity 0.10 1 300.00 query-time-post-filter 0.10 292.969 292.969 HNSW
0.010 1.287 1.279 0.994 100000 100 50 32 250 no 3986 0.00 Infinity 0.10 1 300.00 query-time-post-filter 0.01 292.969 292.969 HNSW
0.940 1.138 1.130 0.993 100000 100 50 32 250 no 3927 249.90 400.17 0.01 1 303.57 index-time-filter 0.50 292.969 292.969 HNSW
0.963 1.001 0.993 0.992 100000 100 50 32 250 no 3598 188.64 530.11 0.01 1 301.41 index-time-filter 0.20 292.969 292.969 HNSW
0.977 0.791 0.783 0.990 100000 100 50 32 250 no 3159 168.33 594.09 0.01 1 300.65 index-time-filter 0.10 292.969 292.969 HNSW
1.000 0.209 0.201 0.962 100000 100 50 32 250 no 1023 155.47 643.22 0.01 1 300.05 index-time-filter 0.01 292.969 292.969 HNSW
Note the reduction in index_size(MB) (when index-time-filter is used) due to re-use of raw vectors!
There is a slight increase in latency with this PR, presumably because of the extra lookup step of the vector position..