langchain
langchain copied to clipboard
Is there a way we can pass in a custom source into vector store?
For example, lets say I have a big txt file (WhatsApp chat export). Now when I'm storing it as embeddings in the vector store, I think the source_document is set as the <name_of_file>.txt
which is fine. But what I want is to attribute a finer source. Like say, the person(s) who said this particular keyword, datetime and so on.
Is this currently supported in Langchain?
I assume that you'd need to first parse the chat export and split it into individual messages. there's a document loader for that: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/whatsapp_chat.html
I'm already using the abstraction of WhatsAppChatLoader. But when ask a query with RetrievalQAWithSourcesChain
, it returns the source as (say) /files/whatsapp_chat_export.txt
. Not the exact message.
well, you can adjust it to produce each message as a separate document (or use a sliding window over several messages)
e.g. something like this (disclaimer: totally untested)
docs = [ ]
for line in lines:
result = re.match(
message_line_regex,
line.strip(),
)
if result:
date, sender, text = result.groups()
text_content = concatenate_rows(date, sender, text)
metadata = {"source": str(p), "sender": sender, "date": date}
docs.append(Document(page_content=text_content, metadata=metadata)
return docs
Hi, @rounakdatta! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue is about whether it is possible to pass a custom source into the vector store in LangChain. Shtratos suggests using the WhatsAppChatLoader abstraction and adjusting it to produce each message as a separate document. They even provided an example code snippet for reference.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository!