langchain icon indicating copy to clipboard operation
langchain copied to clipboard

RetrievalQAWithSourcesChain sometimes does not return sources under sources key

Open AliasSCM opened this issue 1 year ago • 6 comments

I am using RetrievalQAWithSourcesChain to get answers on documents that I previously embedded using pinecone. I notice that sometimes that the sources is not populated under the sources key when I run the chain.

I am using pinecone to embed the pdf documents like so:

documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
           chunk_size=400,
           chunk_overlap=20, 
           length_function=tiktoken_len,
           separators=['\n\n', '\n', ' ', '']
       )
split_documents = text_splitter.split_documents(documents=documents)
Pinecone.from_documents(
       split_documents, 
       OpenAIEmbeddings(), 
       index_name='test_index', 
       namespace= 'test_namespace')

I am using RetrievalQAWithSourcesChain to ask queries like so:

 llm =    OpenAIEmbeddings() 
 vectorstore: Pinecone = Pinecone.from_existing_index(
            index_name='test_index',
            embedding=OpenAIEmbeddings(),
            namespace='test_namespace'
      )  
        
 qa_chain = load_qa_with_sources_chain(llm=_llm, chain_type="stuff")
 qa = RetrievalQAWithSourcesChain(
            combine_documents_chain=qa_chain, 
            retriever=vectorstore.as_retriever(),
            reduce_k_below_max_tokens=True,
      )
        
answer_response = qa({"question": question}, return_only_outputs=True)

Expected response

{'answer': 'some answer', 'sources': 'the_file_name.pdf'}

Actual response {'answer': 'some answer', 'sources': ''}

This behaviour is actually not consistent. I sometimes get the sources in the answer itself and not under the sources key. And at times I get the sources under the 'sources' key and not the answer. I want the sources to ALWAYS come under the sources key and not in the answer text.

Im using langchain==0.0.149.

Am I missing something in the way im embedding or retrieving my documents? Or is this an issue with langchain?

Edit: Additional information on how to reproduce this issue

While trying to reproduce the exact issue for @jpdus I noticed that this happens consistently when I request for the answer in a table format. When the query requests for the answer in a table format, it seems like the source is coming in with the answer and not the source key. I am attaching a test document and some examples here:

Source : UN Doc.pdf

Query 1 (with table): what are the goals for sustainability 2030, povide your answer in a table format?

Response :

{'answer': 'Goals for Sustainability 2030:\n\nGoal 1. End poverty in all its forms everywhere\nGoal 2. End hunger, achieve food security and improved nutrition and promote sustainable agriculture\nGoal 3. Ensure healthy lives and promote well-being for all at all ages\nGoal 4. Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all\nGoal 5. Achieve gender equality and empower all women and girls\nGoal 6. Ensure availability and sustainable management of water and sanitation for all\nGoal 7. Ensure access to affordable, reliable, sustainable and modern energy for all\nGoal 8. Promote sustained, inclusive and sustainable economic growth, full and productive employment and decent work for all\nGoal 9. Build resilient infrastructure, promote inclusive and sustainable industrialization and foster innovation\nGoal 10. Reduce inequality within and among countries\nGoal 11. Make cities and human settlements inclusive, safe, resilient and sustainable\nGoal 12. Ensure sustainable consumption and production patterns\nGoal 13. Take urgent action to combat climate change and its impacts\nGoal 14. Conserve and sustainably use the oceans, seas and marine resources for sustainable development\nGoal 15. Protect, restore and promote sustainable use of terrestrial ecosystems, sustainably manage forests, combat desertification, and halt and reverse land degradation and halt biodiversity loss\nSource: docs/UN Doc.pdf', 'sources': ''}

Query 2 (without table) : what are the goals for sustainability 2030?

Response:

{'answer': "The goals for sustainability 2030 include expanding international cooperation and capacity-building support to developing countries in water and sanitation-related activities and programs, ensuring access to affordable, reliable, sustainable and modern energy for all, promoting sustained, inclusive and sustainable economic growth, full and productive employment and decent work for all, taking urgent action to combat climate change and its impacts, strengthening efforts to protect and safeguard the world's cultural and natural heritage, providing universal access to safe, inclusive and accessible green and public spaces, ensuring sustainable consumption and production patterns, significantly increasing access to information and communications technology and striving to provide universal and affordable access to the Internet in least developed countries by 2020, and reducing inequality within and among countries. \n", 'sources': 'docs/UN Doc.pdf'}

AliasSCM avatar Apr 26 '23 13:04 AliasSCM

You may use this. What kind os documents are you using?

texts = loader.load()


docs = RecursiveCharacterTextSplitter(separator="\n\n",chunk_size=1000, chunk_overlap=0).transform_documents(texts)


Pinecone.from_documents(
       docs, 
       OpenAIEmbeddings(), 
       index_name='test_index', 
       namespace= 'test_namespace')```

pedrobuenoxs avatar Apr 26 '23 17:04 pedrobuenoxs

You may use this. What kind os documents are you using?

texts = loader.load()


docs = RecursiveCharacterTextSplitter(separator="\n\n",chunk_size=1000, chunk_overlap=0).transform_documents(texts)


Pinecone.from_documents(
       docs, 
       OpenAIEmbeddings(), 
       index_name='test_index', 
       namespace= 'test_namespace')```

Im trying to embed Pdf files with text content. I don't see a transform_documents() function on the RecursiveCharacterTextSplitter? I am using the PyPDFLoader to load the documents from my local path

loader = PyPDFLoader(file_path=file_path)

AliasSCM avatar Apr 26 '23 18:04 AliasSCM

It came from TextSplitter.

image

pedrobuenoxs avatar Apr 26 '23 18:04 pedrobuenoxs

@AliasSCM can you show a concrete example?

The default prompt includes a few shot example without sources but the model should only output no sources if there is now answer in the provided documents.

Did you try the same query with vectorstore.as_retriever().get_relevant_documents(...) and are there actual sources available?

Are the documents/queries in a different language than english? In this case it may make sense to modify the QA prompts (which yielded way better results for me in German) - see e.g. here: https://github.com/hwchase17/langchain/issues/3523#issuecomment-1523355163

jphme avatar Apr 26 '23 19:04 jphme

@AliasSCM can you show a concrete example?

The default prompt includes a few shot example without sources but the model should only output no sources if there is now answer in the provided documents.

Did you try the same query with vectorstore.as_retriever().get_relevant_documents(...) and are there actual sources available?

Are the documents/queries in a different language than english? In this case it may make sense to modify the QA prompts (which yielded way better results for me in German) - see e.g. here: #3523 (comment)

pls check my edit on the original question. I have added the documents and outputs and some additional information on how to reproduce

AliasSCM avatar Apr 27 '23 06:04 AliasSCM

@AliasSCM interesting, I also experienced some problems with a QA chain and answers in a table format. Can you try with a custom few shot prompt including a "table question" with sources (see my previous answer for an example on how to customize the prompt)?

Do you have access to GPT-4 and can try with that? I'd guess that gpt-3.5.-turbo is probably "concentrating" too much on the table format and "forgetting" the additional formatting instructions in the context....

jphme avatar Apr 27 '23 09:04 jphme

Hi, @AliasSCM! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you reported is related to the RetrievalQAWithSourcesChain not consistently populating the sources under the sources key when running the chain. It seems that this behavior occurs when requesting the answer in a table format. In the comments, there have been suggestions to use a different method for loading documents, modify the QA prompts, try a custom few shot prompt with sources, and use GPT-4.

Before we proceed, we would like to confirm if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project. We appreciate your support!

Best regards, Dosu

dosubot[bot] avatar Sep 17 '23 17:09 dosubot[bot]