cassandra icon indicating copy to clipboard operation
cassandra copied to clipboard

Add V5VectorPostingsWriter and V5OnDiskOrdinalsMap that can map one-to-many postings without DiskBinarySearch

Open jbellis opened this issue 1 year ago • 1 comments

The Cohere wikipedia dataset contains duplicate vectors and I think we will see that in many other datsaets. Because we need to compact aggressively to reduce segment count, effectively none of our reads can benefit from the one-to-one optimization.

This PR adds a similar but more complex optimization for the one-to-many case, i.e. when multiple rows share the same vector but all rows have at least one vector. (The general case, when not all rows have a vector, remains unoptimized.)

The approach is:

  1. All ordinals that are only referenced by a single row are mapped to that row id
  2. Ordinals that are referenced by multiple rows are mapped to the lowest associated row id
  3. "Extra" rows that are not the lowest are read into memory; if we don't find an entry associated with the row we are looking up, then we can safely assume that the ordinal is the same as the row id

Implementation details:

  • V5 is introduced to avoid adding more layers of difficult-to-reason-about backwards compatibility. Common code is in the OnDiskOrdinalsMap interface; V2 is otherwise unchanged
  • V5 drops support for marked-deleted-but-not-actually-deleted vectors since we are using JVector's built-in deletes now that get removed from the index at flush time
  • V5VectorPostingsWriter.Structure and RemappedPostings is introduced to represent the different types of mapping, and V5VectorPostingsWriter.remapPostings encapsulates the remapping logic that was spread across multiple places
  • V5 supports different on-disk representations for different Structure so we don't have to try to reverse-engineer it from clues in a one-size-fits-all disk format
  • Did some cleanup of computeDeletedOrdinals (renamed to preFlush and changed semantics to match the caller's needs; added check of preFlush return value to the compaction path)
  • V5OnDiskOrdinalsMapTest is significantly more robust than its V2 counterpart and found a couple minor bugs in the existing code, e.g. the inclusivity of endRowId in the one-to-one path

PR is ready for review but marked draft until we have a public JVector version that CI can reference.

NB: The feature flag is V5OnDiskFormat.WRITE_V5_VECTOR_POSTINGS and requires the Version.DC. Both are left enabled on this branch but we should change that before merging.

jbellis avatar Jul 31 '24 16:07 jbellis

    // VSTODO is it worth optimizing this for one-to-many case as well?
    private class GenericRowIdsView implements RowIdsView

testing says the answer is yes, this is one of the bottlenecks on the brute force path now image

jbellis avatar Jul 31 '24 23:07 jbellis

PR is ready for review but marked draft until we have a public JVector version that CI can reference.

marking as ready for review so we can get a CI run in

michaeljmarshall avatar Aug 12 '24 19:08 michaeljmarshall

I think this is ready to commit.

Failure in VectorTypeTest.tracingTest was legit, fixed.

Failures in VectorSiftSmallTest, TinySegmentQueryWriteLifecycleTest, QueryWriteLifecycleTest do not reproduce locally.

The other failing tests are either definitely broken in main (cqlsh tests) or look flaky in the Butler results and do not have any clear connection to the changes here.

@michaeljmarshall can you take a final look at aeb84ce ?

jbellis avatar Aug 16 '24 15:08 jbellis

Spoke too soon, investigating V5OnDiskOrdinalsMapTest failure in latest run.

jbellis avatar Aug 16 '24 16:08 jbellis