extensions icon indicating copy to clipboard operation
extensions copied to clipboard

[MEDI] Clarify the IngestionChunkWriter.WriteAsync contract around documents

Open roji opened this issue 2 months ago • 1 comments

VectorStoreWriter.WriteAsync currently assumes that all chunks for a WriteAsync invocation belong to the same document (the preExistingKeys is only initialized once for the first chunk, code). This assumption should either be more strongly-encoded in the documentation and API, or possibly revisited.

If we want to keep this behavior, where IngestionChunkWriter.WriteAsync is called once for all the chunks of a single document, we should probably:

  • Document it as such
  • Consider validating it (remember the document ID of the first chunk, throw if any subsequent chunk in the loop has a different one)
  • Consider renaming the API from WriteAsync to something like WriteDocumentAsync, or WriteDocumentChunksAsync

Another option is to relax this, and allow having chunks from multiple documents in the same WriteAsync invocation; whether this makes sense depends on the larger archietcture of an MEDI pipeline. Allowing this would mean revisiting how (and possibly when) we delete the previous chunks of a document that's being newly-ingested (overwritten).

I think we should figure this out before GA'ing, as a change here would be breaking.

roji avatar Oct 27 '25 16:10 roji

The writer is consumed by a pipeline, which processes each document individually.

I am in favor of documenting it and enforcing via some runtime checks.

adamsitnik avatar Oct 28 '25 16:10 adamsitnik