extensions icon indicating copy to clipboard operation
extensions copied to clipboard

[MEDI] Ensure that we batch chunks in VectorStoreWriter.WriteAsync

Open roji opened this issue 2 months ago • 1 comments

The current implementation of VectorStoreWriter.WriteAsync accepts a parameter of type IAsyncEnumerable<IngestionChunk<T>>; it then loops over the chunks, and calls UpsertAsync separately for each chunk. This means we perform a database roundtrip for each chunk separately, which is quite inefficient - almost all databases support some form of efficient batching, which allow bulk ingestion of multiple records in one go.

One problem ihere is that WriteAsync accepts IAsyncEnumerable, while the underlying VectorStoreCollection.UpsertAsync accepts a regular IEnumerable; since we shouldn't do sync-over-async, the IAsyncEnumerable must be materiailzed to an array/list at some point, and that array/list handed down to MEVD.

Some thoughts:

  • #6970 discusses whether a single WriteAsync invocation accepts the chunks of a single document only. If we decide that it does, and documents are assumed to not be huge (not sure if we can assume the latter), then we can simply materialize the entire IAsyncEnumerable.
    • However, if we can make these assumptions, then we could consider changing the signature of WriteAsync itself (and probably of other upstream stages in the pipeline) to pass along a materialized IReadOnlyList or similar, rather than an IAE. This would mean a bit more latency at upstream stages as all chunks of a document are collected before being passed along, but codify/simplify things.
    • Note that latency in data ingestion is probably not super high-priority: these ingestion pipelines typically work "offline" in any case, and a bit more delay probably wouldn't be crucial.
  • Otherwise, we could implement some sort of partial batching inside VectorStoreWriter, e.g. materialize up to X chunks, upsert, materialize the next X chunks, etc. (X could have a default and be configurable).
    • This could in theory problematic if we hang for the Nth chunk, holding up the batch. In practice, I don't think this is a big problem (again, I don't see latency as being paramount to this particular pipeline).

This is not really blocking for GA (we can always improve later), but it would be leaving a lot of perf on the table to not do it.

roji avatar Oct 27 '25 17:10 roji

On top of that, we should perform a single batch embedding generation:

https://github.com/microsoft/semantic-kernel/blob/96c470185d9cfe9a07320fd7bfeb3fe511e5821a/dotnet/src/VectorData/SqlServer/SqlServerCollection.cs#L435-L436

Sample: https://github.com/adamsitnik/dataingestion/blob/5e46702aac10c2d535989e88e10986880150fe8e/src/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs#L79

adamsitnik avatar Oct 28 '25 16:10 adamsitnik