genai-stack icon indicating copy to clipboard operation
genai-stack copied to clipboard

Trying to modify the PDF reader with "Sources" information

Open MaxSychevskiy opened this issue 2 years ago • 4 comments

Hi,

I wanted to modify the PDF bot slightly by removing the automatic clean-up of the previous information. Essentially I can load several PDFs and run questions across those. It works in simple terms, but I'm a bit struggling how to add "Source" information to the Neo4J graph so it can beused as part of the answer. The Source could be as simple as name of the file.

Any help from anyone?

MaxSychevskiy avatar Dec 12 '23 01:12 MaxSychevskiy

There's a RetrievalQAWithSourcesChain mentioned here https://python.langchain.com/docs/integrations/vectorstores/neo4jvector

I've tried swapping that for from RetrievalQA in pdf_bot.py but not managed to get it to work yet.

MikePos1581 avatar Dec 18 '23 15:12 MikePos1581

@tomasonjo correct me if I'm wrong but the main thing is to provide a {metadata: {source: source-link}} to the qa_chain_with_sources ?

from langchain.chains.qa_with_sources import load_qa_with_sources_chain

https://python.langchain.com/docs/use_cases/question_answering/sources

jexp avatar Jan 24 '24 17:01 jexp

Feel free to send a PR

jexp avatar Jan 24 '24 17:01 jexp

To store source information to Neo4j, you would need to use from_documents instead of from_texts method to populate the vector index. You could do this with something like:

from langchain.schema import Document
documents = [Document(content=text, metadata={source:'PDF file name'}) for text in texts]

Any key-value pair in metadata is stored as additional node properties.

tomasonjo avatar Jan 24 '24 17:01 tomasonjo