langchainrb icon indicating copy to clipboard operation
langchainrb copied to clipboard

Allow adding metadata for chunks of files

Open drale2k opened this issue 2 years ago • 19 comments

Currently only add_texts takes an 'metadata' argument but add_data does not. Since add_data takes an array of files it would be clunky to extend it to allow metadata directly. Adding metadata needs to happen on a chunk-level.

The use case i have for this is to add the page number a chunk was found on and reference that as the source of the information.

To work around it i am currently reading and chunking files manually and then calling add_texts to supply the metadata. It's not too difficult but it would be nice if this was easier.

drale2k avatar Sep 18 '23 00:09 drale2k

@drale2k Do you think this functionality should go into https://github.com/moekidev/baran which is the gem we're using to do chunking?

andreibondarev avatar Sep 18 '23 15:09 andreibondarev

Good question given that baran is a Text Splitter specifically for LLMs but even if baran were to accept metadata, langchainrb still needs to take it as input. You still want people to interact with the lanchainrb APIs and not baran directly or?

drale2k avatar Sep 20 '23 15:09 drale2k

@drale2k Correct, I'm just saying that those changes would need to happen in the baran gem itself first and then Langchain.rb would make the corresponding changes to accept metadata. I think instead of returning the plain chunks array, baran should return a different data structure that would hold all that metadata as well. Do you want to suggest those changes to @moekidev?

andreibondarev avatar Sep 20 '23 15:09 andreibondarev

https://github.com/moekidev/baran/issues/7#issuecomment-1734999638

moeki0 avatar Sep 26 '23 07:09 moeki0

We released! https://github.com/moekidev/baran/releases/tag/v0.1.9

moeki0 avatar Sep 26 '23 23:09 moeki0

@drale2k Which vectorsearch DB are you using btw? And what kind of files are you looking to upload?

andreibondarev avatar Sep 28 '23 18:09 andreibondarev

Currently mostly pinecone but have been looking into open source ones as well. PDFs, MS Office docx and ppt mostly. Starting to look into audio transcriptions as well using https://github.com/guillaumekln/faster-whisper

drale2k avatar Sep 28 '23 20:09 drale2k

this would help support the ability to add metadata such as source document names / source urls for the text.

I can see this being useful in add_data by optionally being able to pass an array of objects vs just string paths. checking the class of the "path" object before passing it to the chunker so and object that looks like

{ path: 'string/path/to/file', metadata: { url: "https://some.location.com/some-page-name", path: 'string/path/to/file', style: "blues" }

could be sent to the chunker would be nice. Then when we ask the vectorsearch database for similarities we should also get the metadata back to use for source links

jjimenez avatar Mar 22 '24 22:03 jjimenez

I'm still reading code... It looks like Langchain::Loader actually will take a url! That is nice. I'll have to give that a try. It would be nice if the url was passed into the vectorsearch database as metadata directly.

I'm wishing out loud and should definitely consider making a pull request.

Thanks for making this a lot easier!

jjimenez avatar Mar 22 '24 22:03 jjimenez

@jjimenez Take a look at this draft branch I'm working on: https://github.com/andreibondarev/langchainrb/pull/538/files. The rest of the vectorsearch DBs need to be fixed to accept the metadatas: param.

andreibondarev avatar Mar 23 '24 00:03 andreibondarev

any news on this feature ?

pedroresende avatar Apr 22 '24 09:04 pedroresende

I'm interested in this feature, particularly for pgvector. I'm noticing that the different vectorsearch dbs don't all currently have a schema to support storing this new metadata. Would this be something you would be interested in help with?

sean-dickinson avatar May 17 '24 19:05 sean-dickinson

I'm interested in this feature, particularly for pgvector. I'm noticing that the different vectorsearch dbs don't all currently have a schema to support storing this new metadata. Would this be something you would be interested in help with?

@sean-dickinson Yes! Any help here would be extremely appreciated! Do you have a DSL in mind we'd implemented?

andreibondarev avatar May 20 '24 14:05 andreibondarev

I'm interested in this feature, particularly for pgvector. I'm noticing that the different vectorsearch dbs don't all currently have a schema to support storing this new metadata. Would this be something you would be interested in help with?

@sean-dickinson Yes! Any help here would be extremely appreciated! Do you have a DSL in mind we'd implemented?

@andreibondarev I'm not sure it's so much a DSL as a standardized schema for storing the data. That being said I'm not very knowledgeable in terms of the different vector databases here, but I'm assuming that in an ideal schema we would be able to store an object the represents the original data source (with a unique identifier) and then a collection of objects with the actual text splits that reference the original source.

The metadata field could live on either the parent or the chunks, (or both I suppose if you wanted?) but probably on the parent makes the most sense? Then when you do a search you are searching the chunks and you can also grab the parent record that the chunks reference to get the metadata.

I'm thinking this schema allows for the easiest updates if you are using sources change (for instance if your source is a url and the content updates, the url is the same but you want to clear out the old chunks and add new ones).

Note I'm taking these ideas from the LangChain python pgvector implementation.

In terms of a DSL, maybe it makes sense to name the concept of data like a DataSource to help conform to a more structure schema?

sean-dickinson avatar May 20 '24 14:05 sean-dickinson

@sean-dickinson I think I would be more in favor of an iterative approach here: enhancing and standarding the current schema across all the different vectorsearch DBs as opposed to overhauling it. We can slowly iterate towards the ideal state eventually.

andreibondarev avatar May 20 '24 15:05 andreibondarev

@sean-dickinson I think I would be more in favor of an iterative approach here: enhancing and standarding the current schema across all the different vectorsearch DBs as opposed to overhauling it. We can slowly iterate towards the ideal state eventually.

@andreibondarev totally fair. I think we can essentially achieve the same functionality with just adding a metadata field for each of the vector dbs like you said, then it could always be improved upon in the future should the need arise.

Regardless, what's the state of your branch referenced here where you added the metadata as part of the parsing process? Do you want to build on that and update all the vector db adapters to expect this new field on that branch? Or do you want a separate PR to update all the vector db schemas?

sean-dickinson avatar May 20 '24 16:05 sean-dickinson

@sean-dickinson I think maybe we, first, standardize the metadata: {} param for all of the different vectorsearch providers.

We could probably do Pgvector first, I think you'd need to add some sort of a metadata JSON column and change this method: https://github.com/patterns-ai-core/langchainrb/blob/main/lib/langchain/vectorsearch/pgvector.rb#L73-L83

What're your thoughts?

andreibondarev avatar May 20 '24 18:05 andreibondarev

I took a stab at this for pgvector following the guidance of the above comment https://github.com/patterns-ai-core/langchainrb/pull/859

aellispierce avatar Oct 28 '24 11:10 aellispierce