extensions icon indicating copy to clipboard operation
extensions copied to clipboard

[MEDI] Figure out the metadata story for VectorStoreWriter

Open roji opened this issue 2 months ago • 3 comments

The current implementation of VectorStoreWriter lazily constructs the metadata based on the first chunk it's given; this has the advantage of making VectorStoreWriter very easy to use - unlike the underlying VectorStoreCollection, the user doesn't need to explicitly pre-configure the record schema, etc. Note that the chunk metadata is only one part of the data getting saved (the variable/non-predefined part) - there's also the static/fixed part of the schema (key, content, embedding, context, document id).

However, while this approach is indeed very usable, it breaks down if the metadata varies across chunks: the first chunk determines the VectorStoreCollection's schema, and a subsequent chunk with a different schema would cause a failure. Note that the concern here isn't necessarily very different metadata, more cases where e.g. the first chunk omits a metadata field because it is null (but subsequent ones don't).

Some options:

  1. Require users to provide the metadata in advance. This could be done simply by having them provide a fully-baked VectorStoreCollection (see #6967), or some other API which would accept the metadata schema in some form (e.g. Dictionary<string, Type>).
  2. Somehow define the metadata schema more strongly upstream in the MEDI pipeline, flowing it from there to VectorStoreWriter. This would mainly make sense if we can think of other places where knowing the metadata schema ahead of time could be useful.
  3. Simply maintain the current story, and document as part of the contract that all metadata dictionaries must always contain the same keys and value types. We can also consider validating this when ingesting chunks (so that if a later chunk arrives with different metadata, we throw).

I think it's reasonable to go with (3) (but I'd consider validating - it seems to be very little work and add very little perf overhead). We can always optionally allow (1) later by implementing #6967 if there's a need.

roji avatar Oct 27 '25 16:10 roji

I've been thinking about this some more today, and a conversation with @westey-m also made me realize some things.

With MEDI we're obviously concentrating on the write side. But we need to keep a holistic wider look: for people using MEDI to ingest data, what is the story for querying that data?

If we expect them to create an MEVD VectorStoreCollection directly, then that VectorStoreCollection would have to be explicitly configured with the complete metadata, corresponding to what has been ingested. At that point, the frictionless, quick & easy approach we have in MEDI - where metadata is automatically inferred from the first chunk - becomes much less valuable, since the user has to set up a VectorStoreCollection in any case, for the read side. So I'd expect users to have code that creates their VectorStoreCollection (complete with model/schema), and then using that same collection for both the write side (passing it to VectorStoreWriter, #6967) and for the read side.

We could also consider a higher-level abstraction (really a bit of sugar) over VectorStoreCollection for the read side - that would e.g. help with the static schema that VectorStoreWriter imposes (@westey-m is working/thinking about something like this). Here also, I think there's no way to infer the metadata - it seems that the user would have to specify it in advance.

To summarize... While the idea of inferring metadata schema is nice and works well within the write-only context, I'm not sure it works when looking at things more holistically...

roji avatar Oct 28 '25 15:10 roji

The schema consist of:

  • the static/fixed part: content, embedding, context, document id and key.
  • the variable part, which is dependent on the enrichers defined by the user (I don't expect majority of users to always define some).

Currently the enrichers and writer do not know about each other, they are just part of the pipline.

We could extend the ChunkProcessor abstract class with a property that provides a collection of pairs: metadata field name and type and then provide this information to the WriteAsync method.

But as you have noticed, this would solve only part of the problem: ingestion.

When it comes to reading, the writer is currently exposing the VectorStoreCollection:

https://github.com/dotnet/extensions/blob/7e48a612613d7e249962cdfdefdfea696c46ffd0/src/Libraries/Microsoft.Extensions.DataIngestion/Writers/VectorStoreWriter.cs#L54-L63

But in oder to use it, the user needs to know the name of the fields. Example:

Dictionary<string, object?> record = await writer.VectorStoreCollection
    .GetAsync(filter: record => (string)record["documentid"]! == documentId, top: 1)
    .SingleAsync();

Which is far from perfect. I am open to any suggestions.

adamsitnik avatar Oct 28 '25 17:10 adamsitnik

When it comes to reading, the writer is currently exposing the VectorStoreCollection:

Yeah, I'm not sure that's a great story... Even regardless of the field names you mention, it means that on the read side - which is typically a completely separate application (e.g. the website) - the program would instantiate a VectorStoreWriter only in order to extract the VectorStoreCollection from it, to be able to query. I'd expect the read side to not even reference MEDI.

Ultimately, the only way to really make this work in a good, bullet-proof way is to accept a baked VectorStoreCollection that closes over the schema; this way, the user can use the same VectorStoreCollection on both the read and write sides. I'd love to help the user to build the schema (record definition), but once again I'm not sure we want to force the user to take a dependency on MEDI on the read side, so we may be limited here.

Let's think about this a bit more, we'll come up with something.

roji avatar Oct 28 '25 17:10 roji