langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Is there a way we can pass in a custom source into vector store?

Open rounakdatta opened this issue 1 year ago • 3 comments

For example, lets say I have a big txt file (WhatsApp chat export). Now when I'm storing it as embeddings in the vector store, I think the source_document is set as the <name_of_file>.txt which is fine. But what I want is to attribute a finer source. Like say, the person(s) who said this particular keyword, datetime and so on.

Is this currently supported in Langchain?

rounakdatta avatar Apr 15 '23 14:04 rounakdatta

I assume that you'd need to first parse the chat export and split it into individual messages. there's a document loader for that: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/whatsapp_chat.html

shtratos avatar Apr 16 '23 11:04 shtratos

I'm already using the abstraction of WhatsAppChatLoader. But when ask a query with RetrievalQAWithSourcesChain, it returns the source as (say) /files/whatsapp_chat_export.txt. Not the exact message.

rounakdatta avatar Apr 16 '23 17:04 rounakdatta

well, you can adjust it to produce each message as a separate document (or use a sliding window over several messages)

e.g. something like this (disclaimer: totally untested)


     docs = [ ]
     for line in lines:
            result = re.match(
                message_line_regex,
                line.strip(),
            )
            if result:
                date, sender, text = result.groups()
                text_content = concatenate_rows(date, sender, text)

               metadata = {"source": str(p), "sender": sender, "date": date}
               docs.append(Document(page_content=text_content, metadata=metadata)

    return docs

shtratos avatar Apr 17 '23 21:04 shtratos

Hi, @rounakdatta! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue is about whether it is possible to pass a custom source into the vector store in LangChain. Shtratos suggests using the WhatsAppChatLoader abstraction and adjusting it to produce each message as a separate document. They even provided an example code snippet for reference.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

dosubot[bot] avatar Sep 03 '23 16:09 dosubot[bot]