lucene Create a read-only index that drops index files not needed for searching

Description

Now that we have vector quantization we face the possibility of writing an index that is 5 times bigger than is needed for searching. If the index is primarily vectors and they are quantized, we will save the full-precision vectors but they may not be required at all for searching. In an architecture where indexes are written on one set of hosts and replicated to another set of hosts for searching, it is wasteful to copy all of the full-precision vectors to the searcher nodes. But Lucene doesn't have any way of distinguishing. I wonder if we could create a "write read-only index" operation that would effectively clone the existing index, dropping any data required only for indexing, and mark the index as read-only so it could never be opened for writing. This might be useful in some way for version upgrades as well?

Mar 05 '24 18:03 msokolov

This is a cool idea @msokolov. It is wasteful to lug around those float32 precision vectors out to the searchers in an NRT segment replication architecture. In practice, they would consume disk space on the searchers, and waste time copying them out, but since the OS would never load them at search time, their bytes would remain cold on disk and not put much pressure on OS free RAM? The OS would only cache the disk pages in the index that are actually needed at search time. It would be nice not to copy all that deadweight around ...

Probably the solution would have to be something like segment to segment? I.e. for each segment in the index, we would make a corresponding "read only" segment (stripped of the float32 vectors). This way, as the normal index changes (gets new flushed/merged segments), we could also incrementally/NRT maintain the shadow read-only index.

I wonder if there are other things in a Lucene index today that are needed only during indexing?

Mar 21 '24 12:03 mikemccand

Don't "deletes" require "writes"? Meaning, if enough docs get deleted in a segment, it will ultimately require to be merged, which then is a "write"?

For a quicker win in scalar quantization, it could be cheaper or easier to have a configured threshold where we throw away the floating point as we know the distribution of the values won't change significantly from that point on. Then on merges, if adjustments are required, we assume the cost of de-quantizing and re-quantizing. This can help relevancy more than you would expect, but obviously not as much as having access to the raw values.

Mar 21 '24 13:03 benwtrent

Don't "deletes" require "writes"? Meaning, if enough docs get deleted in a segment, it will ultimately require to be merged, which then is a "write"?

The goal of segment-replication is to completely separate searching from writing, so in that world, no merging is done by searchers -- it happens upstream in a writer/indexer, or perhaps in a dedicated merger (we have both setups going on).

Mar 21 '24 16:03 msokolov

@msokolov AH, yes, for segment-replication, once the segments are built, I could see certain things being removed. I better understand the idea now.

Mar 21 '24 17:03 benwtrent

Coming back to this issue:

Summary

We tried to implement this idea in our closed-source implementation and got very good results, reduction in vector index size to one-fifth of the original size (80% reduction as mentioned in this issue).

Background:

At Amazon, we have decoupled architecture where Lucene writers and searchers run on separate machines. Writers create the index and upload it to an S3 bucket, and searchers use the index after downloading it from the same S3 bucket.

Since full-precision float vectors are needed by writers for HNSW graph merging, we didn't modify anything there. However, once the index is read by searchers, assuming we don't need full-precision vectors anymore because only quantized vectors take part in vector scoring for HNSW searches, the full-precision vectors are just sitting idle on disk. So, in our closed-source implementation, we tried dropping full-precision vectors while downloading the checkpoint from S3 (explained in detail below).

Implementation:

In our first/naive attempt, we simply tried to remove the full-precision vector files directly (vec and vem files), but this caused the codec to throw an IndexCorruptException.

Instead, here's what we did: While writing the index to the S3 bucket, Lucene writers uploaded additional empty full-precision vector files to S3.

Normally this is how these files look like

vec:

* HEADER (Codec Magic, CodecName, Version, Segment ID, Segment-Suffix Length, Segment Suffix)
* Offset: To adjust start of vector data position (multiplies of Float Bytes i.e 4 Bytes)
* Data : 
   * Vectors: *Actual Float Vectors*
   * -1 *(To mark end of the data)* 
* FOOTER (FooterMagic, Checksum)

vemf

* HEADER (Codec Magic, CodecName, Version, Segment ID, Segment-Suffix Length, Segment Suffix)
* Data
   * Field Number
   * Vector Encoding ordinal (Bytes/Float)
   * Similarity Function ordinal (Cosine/Dot-Product etc)
   * Start position of vectors in ```.vec``` file
   * Total length of vectors
   * Vector Dimension
   * Total Vectors
   * Ord to Doc Information
   * -1 to mark end of field infos.
* FOOTER (FooterMagic, Checksum)

For our optimization, we created empty files like these:

vec:

* HEADER (Codec Magic, CodecName, Version, Segment ID, Segment-Suffix Length, Segment Suffix)
* Offset: To adjust start of vector data position (multiplies of Float Bytes i.e 4 Bytes)
* Data : <<No vector data>>
   * -1 *(To mark end of the data)* 
* FOOTER (FooterMagic, Checksum)

vemf

* HEADER (Codec Magic, CodecName, Version, Segment ID, Segment-Suffix Length, Segment Suffix)
* Data
   * Field Number
   * Vector Encoding ordinal (Bytes/Float) <<Same as original>>
   * Similarity Function ordinal (Cosine/Dot-Product etc) <<Same as original>>
   * Start position of vectors in ```.vec``` file <<Same as original>>
   * Total length of vectors <<Zero in this case>>
   * Vector Dimension <<Same as original>>
   * Total Vectors <<Zero in this case>>>>
   * Ord to Doc Information <<0 document information>>
   * -1 to mark end of field infos.
* FOOTER (FooterMagic, Checksum)

On Lucene searchers, while downloading the index from S3, we skipped downloading the original full-precision vector files and instead downloaded only the empty full-precision vector files. This saved us 80% of storage space on searchers and also reduced the downloading time from S3.

Next Steps:

Based on the above work, I wanted to know what the community thinks about this and whether we should implement this in the open-source Lucene repo as well. For example, we could add support to write empty vector files directly from our codec and give users the flexibility to choose whether they want to use full-precision files or not.

Oct 21 '25 20:10 Pulkitg64

A few questions:

what would the API look like for stripping full-precision vectors and keeping only quantized ones? I guess it could be a codec option?
what would happen if updates are made to the index configured with such a Codec? I think we'd ideally want to prevent that or somehow signal to the user that merges will fail. Ideally IndexWriter.add/updateDocument(s) would catch it, but I could also imagine a check in Directory (read-only Directory that cannot write any data would throw an exc), or possibly even a failure would get deferred until vector format attempts to merge and finds zero full precision vectors and nonzero quantized vectors? Otherwise it will silently delete all vector data...

Oct 24 '25 15:10 msokolov

This saved us 80% of storage space on searchers and also reduced the downloading time from S3.

Just to clarify -- this is 80% smaller storage for just the vectors portion of the index. We (Amazon customer facing product search team -- I work with @Pulkitg64 and @msokolov) still have lots of other things in the Lucene index! Overall top-line reduction I think was ~10%, but that equates to PB (petabytes) of savings each day across the whole fleet! And as more and more vectors, with higher and higher dimensionality, are added to our indices, the vector portion of the index is a larger part, and these savings get bigger over time.

We can only do this because we fully rely on scalar quantized vectors for searching ... e.g. we never do 2nd phase reranking with full precision vectors. If a query wants to retrieve a vector as a return field, we re-hydrate the quantized form back to (lossy due to quantization round trip) full precision. And also because we have physical isolation of indexing and searching, using NRT segment replication (via S3 so we also get full, incremental backups on every commit point) to copy new segments on each commit from indexers to searchers.

As @Pulkitg64 described, our current solution is kinda hackity/messy because the Codec (and therefore IndexWriter, SegmentInfos, etc.) don't know about these files.

For example, we could add support to write empty vector files directly from our codec and give users the flexibility to choose whether they want to use full-precision files or not.

+1, I like this approach.

Indexing would always write two sets of files (one with all the full precision vectors indexed, another with no vectors which will be tiny files -- just header and footer). Codec would own the inventory of these empty-full-precision files (adding them to .files()). And then they would be deleted at the right time since IndexFileDeleter would include them in ref counting (when that segment is merged away). And nothing would open them for reading, by default... they just "lurk"!

A few questions:

Yeah this is the tricky part!

The one thing that needs to know it will open only for reading is the KnnVectorsReader, but we have no clean way to pass index-open-time parameters to Codecs I think? But you're right, we would also need the index to somehow store that it is a read-only index because something along the way dropped the full precision files? Not sure how to do that ... tricky part! It makes the empty-file-writing part seem easy lol.

Dec 13 '25 02:12 mikemccand