Add V5VectorPostingsWriter and V5OnDiskOrdinalsMap that can map one-to-many postings without DiskBinarySearch
The Cohere wikipedia dataset contains duplicate vectors and I think we will see that in many other datsaets. Because we need to compact aggressively to reduce segment count, effectively none of our reads can benefit from the one-to-one optimization.
This PR adds a similar but more complex optimization for the one-to-many case, i.e. when multiple rows share the same vector but all rows have at least one vector. (The general case, when not all rows have a vector, remains unoptimized.)
The approach is:
- All ordinals that are only referenced by a single row are mapped to that row id
- Ordinals that are referenced by multiple rows are mapped to the lowest associated row id
- "Extra" rows that are not the lowest are read into memory; if we don't find an entry associated with the row we are looking up, then we can safely assume that the ordinal is the same as the row id
Implementation details:
- V5 is introduced to avoid adding more layers of difficult-to-reason-about backwards compatibility. Common code is in the OnDiskOrdinalsMap interface; V2 is otherwise unchanged
- V5 drops support for marked-deleted-but-not-actually-deleted vectors since we are using JVector's built-in deletes now that get removed from the index at flush time
-
V5VectorPostingsWriter.StructureandRemappedPostingsis introduced to represent the different types of mapping, andV5VectorPostingsWriter.remapPostingsencapsulates the remapping logic that was spread across multiple places - V5 supports different on-disk representations for different Structure so we don't have to try to reverse-engineer it from clues in a one-size-fits-all disk format
- Did some cleanup of
computeDeletedOrdinals(renamed topreFlushand changed semantics to match the caller's needs; added check of preFlush return value to the compaction path) - V5OnDiskOrdinalsMapTest is significantly more robust than its V2 counterpart and found a couple minor bugs in the existing code, e.g. the inclusivity of endRowId in the one-to-one path
PR is ready for review but marked draft until we have a public JVector version that CI can reference.
NB: The feature flag is V5OnDiskFormat.WRITE_V5_VECTOR_POSTINGS and requires the Version.DC. Both are left enabled on this branch but we should change that before merging.
// VSTODO is it worth optimizing this for one-to-many case as well?
private class GenericRowIdsView implements RowIdsView
testing says the answer is yes, this is one of the bottlenecks on the brute force path now
PR is ready for review but marked draft until we have a public JVector version that CI can reference.
marking as ready for review so we can get a CI run in
I think this is ready to commit.
Failure in VectorTypeTest.tracingTest was legit, fixed.
Failures in VectorSiftSmallTest, TinySegmentQueryWriteLifecycleTest, QueryWriteLifecycleTest do not reproduce locally.
The other failing tests are either definitely broken in main (cqlsh tests) or look flaky in the Butler results and do not have any clear connection to the changes here.
@michaeljmarshall can you take a final look at aeb84ce ?
Spoke too soon, investigating V5OnDiskOrdinalsMapTest failure in latest run.
Quality Gate passed
Issues
14 New issues
0 Accepted issues
Measures
0 Security Hotspots
85.2% Coverage on New Code
2.3% Duplication on New Code