ragas icon indicating copy to clipboard operation
ragas copied to clipboard

Customizing Testset Generation: list index out of range

Open GolfHotelSierra opened this issue 1 year ago • 7 comments

Your Question

  • When customising Testset Generation, I used InMemoryDocumentStore and LCDocument from langchain.
  • An error occurred during calling generate_with_langchain_docs function, during debugging, the detailed error message is as follows:

Connected to pydev debugger (build 221.5080.212) Created a chunk of size 593, which is longer than the specified 400 Created a chunk of size 1090, which is longer than the specified 400 Created a chunk of size 711, which is longer than the specified 400 Created a chunk of size 597, which is longer than the specified 400 Created a chunk of size 478, which is longer than the specified 400 Traceback (most recent call last). File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/ragas/executor.py", line 75, in run results = self.loop.run_until_complete(self._aresults()) File "/opt/conda/envs/evaluation-ljx/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/ragas/executor.py", line 63, in _aresults raise e File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/ragas/executor.py", line 58, in _aresults r = await future File "/opt/conda/envs/evaluation-ljx/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one return f.result() # May raise f.exception(). File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/ragas/executor.py", line 91, in wrapped_callable_async return counter, await callable(*args, **kwargs) File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/ragas/testset/extractor.py", line 49, in extract results = await self.llm.generate(prompt=prompt, is_async=is_async) File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/ragas/llms/base.py", line 92, in generate python-BaseException return await agenerate_text_with_retry( File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/tenacity/_asyncio.py", line 88, in async_wrapped return await fn(*args, **kwargs) File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/tenacity/_asyncio.py", line 47, in call do = self.iter(retry_state=retry_state) File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/tenacity/init.py", line 325, in iter raise retry_exc.reraise() File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/tenacity/init.py", line 158, in reraise raise self.last_attempt.result() File "/opt/conda/envs/evaluation-ljx/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/opt/conda/envs/evaluation-ljx/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/tenacity/asyncio.py", line 50, in call result = await fn(*args, **kwargs) File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/ragas/llms/base.py", line 177, in agenerate_text result = await self.langchain_llm.agenerate_prompt( File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/langchain_core/language_models/llms.py", line 578, in agenerate prompt return await self.agenerate( File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/langchain_core/language_models/llms.py", line 853, in agenerate isinstance(callbacks[0], (list, BaseCallbackManager)) IndexError: list index out of range

Process finished with exit code -1

  • I don't know much about asyn, so I can only pinpointed that the add_nodes function of InMemoryDocumentStore trigger the error
  • File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/langchain_core/language_models/llms.py", line 853, the corresponding code is as follows:

# Create callback managers
if isinstance(callbacks, list) and (
    isinstance(callbacks[0], (list, BaseCallbackManager))
    or callbacks[0] is None
).

I found callbacks=[], which triggered the callbacks[0] index out of range

Code Examples

from langchain_community.document_loaders import TextLoader
from ragas.testset.generator import TestsetGenerator
from ragas.llms.base import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_community.llms.xinference import Xinference
from langchain_community.embeddings import XinferenceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from ragas.testset.docstore import InMemoryDocumentStore
from ragas.testset.extractor import KeyphraseExtractor
from ragas.testset.evolutions import simple, reasoning, multi_context

if __name__ == '__main__':
    loader = TextLoader(file_path='chap01.tex', encoding='utf-8')
    document = loader.load()
    for d in document:
        d.metadata['file_name'] = d.metadata['source']
    llm = Xinference(server_url="http://0.0.0.0:9997", model_uid='qwen-chat')
    langchain_llm = LangchainLLMWrapper(langchain_llm=llm)
    embeddings = XinferenceEmbeddings(server_url="http://0.0.0.0:9997", model_uid='bge-large-zh-v1.5')
    langchain_embeddings = LangchainEmbeddingsWrapper(embeddings=embeddings)
    generator = TestsetGenerator(generator_llm=langchain_llm, critic_llm=langchain_llm, embeddings=langchain_embeddings,
                                 docstore=InMemoryDocumentStore(
                                     splitter=CharacterTextSplitter(chunk_size=400, chunk_overlap=0),
                                     extractor=KeyphraseExtractor(llm=langchain_llm), embeddings=langchain_embeddings))
    test_dataset = generator.generate_with_langchain_docs(documents=document, test_size=10,
                                                          distributions={simple: 0.5, reasoning: 0.25,
                                                                         multi_context: 0.25}, is_async=False)
    pandas = test_dataset.to_pandas()
    print(pandas)

Additional context

  • I found that in package ragas.llm.base, function like agenerate_text and generate etc. all use callbacks: Callbacks = [] as default, and I can't say for sure whether the default cause the problem or not, since I stopped debugging when the traceback jump into the asyn part.

GolfHotelSierra avatar Feb 28 '24 17:02 GolfHotelSierra

  • These lines raise the error
test_dataset = generator.generate_with_langchain_docs(documents=document, test_size=10,
                                                      distributions={simple: 0.5, reasoning: 0.25,
                                                                     multi_context: 0.25}, is_async=False)
  • In function generate_with_langchain_docs,
    def generate_with_langchain_docs(
        self,
        documents: t.Sequence[LCDocument],
        test_size: int,
        distributions: Distributions = {},
        with_debugging_logs=False,
        is_async: bool = True,
        raise_exceptions: bool = True,
        run_config: t.Optional[RunConfig] = None
    ):
        # chunk documents and add to docstore
        self.docstore.add_documents(
            [Document.from_langchain_document(doc) for doc in documents]
        )

        return self.generate(
            test_size=test_size,
            distributions=distributions,
            with_debugging_logs=with_debugging_logs,
            is_async=is_async,
            raise_exceptions=raise_exceptions,
            run_config=run_config,
        )

and the following part cause the error,

self.docstore.add_documents(
            [Document.from_langchain_document(doc) for doc in documents]
        )

due to the function add_documents

GolfHotelSierra avatar Feb 29 '24 05:02 GolfHotelSierra

Interesting, can you take a look at this @jjmachan ?

shahules786 avatar Feb 29 '24 05:02 shahules786

@jjmachan, @shahules786 I encountered the same issue when I used Google LLMs to generate the test set. It also returned an IndexError: list index out of range. This bug resulted in doubling the number of nodes. For instance, in the following script:

test_dataset = generator.generate_with_langchain_docs(documents=document, test_size=10,
                                distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}) 

Assuming my document size is 100, I would expect it to add 100 nodes in the above step. However, when I ran it with Google LLMs, it appeared to add 200.

However, when I ran it with ragas' .with_openai models, it ran smoothly and had 100 nodes in the process.

chizhang-lbg avatar Mar 03 '24 00:03 chizhang-lbg

Hello @chizhang-lbg It also ran smoothly for me when using openai models, like

llm = ChatOpenAI()
langchain_llm = LangchainLLMWrapper(langchain_llm=llm)

but it got stuck and throw an error if i add open_api_base to it, like this

llm = ChatOpenAI(openai_api_base='https://api.chatanywhere.tech')

mostly it'll stop at 47% or 48% during embedding nodes: 47%

And did you encounter list out of range with the callbacks here?


# Create callback managers
if isinstance(callbacks, list) and (
    isinstance(callbacks[0], (list, BaseCallbackManager))
    or callbacks[0] is None
).

GolfHotelSierra avatar Mar 05 '24 11:03 GolfHotelSierra

Got the same error as @GolfHotelSierra. I am using Langchain LLM, embedding model and an InMemoryDocumentStore.

Error: Traceback (most recent call last): File "...\lib\threading.py", line 932, in _bootstrap_inner self.run() File "...\Python38\site-packages\ragas\executor.py", line 93, in run results = self.loop.run_until_complete(self._aresults()) File "...\lib\asyncio\base_events.py", line 616, in run_until_complete return future.result() File "...\Python38\site-packages\ragas\executor.py", line 81, in _aresults raise e File "...\Python38\site-packages\ragas\executor.py", line 76, in aresults r = await future File "...\lib\asyncio\tasks.py", line 619, in wait_for_one return f.result() # May raise f.exception(). File "...\Python38\site-packages\ragas\executor.py", line 36, in sema_coro return await coro File "...\Python38\site-packages\ragas\executor.py", line 109, in wrapped_callable_async return counter, await callable(*args, **kwargs) File "...\Python38\site-packages\ragas\testset\extractor.py", line 49, in extract results = await self.llm.generate(prompt=prompt, is_async=is_async) File "...\Python38\site-packages\ragas\llms\base.py", line 92, in generate return await agenerate_text_with_retry( File "...\site-packages\tenacity_asyncio.py", line 88, in async_wrapped return await fn(*args, **kwargs) File "...\lib\site-packages\tenacity_asyncio.py", line 47, in call do = self.iter(retry_state=retry_state) File "...\lib\site-packages\tenacity_init.py", line 325, in iter raise retry_exc.reraise() File "...\lib\site-packages\tenacity_init.py", line 158, in reraise raise self.last_attempt.result() File "...\lib\concurrent\futures_base.py", line 437, in result return self.__get_result() File "...\lib\concurrent\futures_base.py", line 389, in __get_result raise self._exception File "...\lib\site-packages\tenacity_asyncio.py", line 50, in call result = await fn(*args, **kwargs) File "...\Python38\site-packages\ragas\llms\base.py", line 179, in agenerate_text result = await self.langchain_llm.agenerate_prompt( File "...\Python38\site-packages\langchain_core\language_models\llms.py", line 578, in agenerate_prompt return await self.agenerate( File "...\Python38\site-packages\langchain_core\language_models\llms.py", line 853, in agenerate isinstance(callbacks[0], (list, BaseCallbackManager)) IndexError: list index out of range Traceback (most recent call last): File "test.py", line 103, in test_set = generator.generate_with_langchain_docs( File "...\Python38\site-packages\ragas\testset\generator.py", line 152, in generate_with_langchain_docs self.docstore.add_documents( File "...\Python38\site-packages\ragas\testset\docstore.py", line 215, in add_documents self.add_nodes(nodes, show_progress=show_progress) File "...\Python38\site-packages\ragas\testset\docstore.py", line 254, in add_nodes raise ExceptionInRunner()

The code I ran:

ollama_llm = Ollama(model="llama2:7b-chat-q4_0")
langchain_llm_ollama = LangchainLLMWrapper(ollama_llm)

embedding_model = HuggingFaceEmbeddings(
    model_name=embedding_model_path,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

langchain_embeddings = LangchainEmbeddingsWrapper(embedding_model)
langchain_llm_ollama = LangchainLLMWrapper(ollama_llm)

splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=40)
keyphrase_extractor = KeyphraseExtractor(llm=langchain_llm_ollama)

docstore = InMemoryDocumentStore(
    splitter=splitter,
    embeddings=langchain_embeddings,
    extractor=keyphrase_extractor,
)

if __name__ == "__main__":
    test_set = generator.generate_with_langchain_docs(
        flat_documents,
        test_size=10,
        distributions=distributions)
    pandas_test_set = test_set.to_pandas()

Zvapo avatar Mar 05 '24 12:03 Zvapo

Hey guys @GolfHotelSierra @chizhang-lbg @Zvapo We are taking issues one by one and will certainly get to this. As a small team, we would really appreciate any help from you guys to improve our project :)

shahules786 avatar Mar 05 '24 18:03 shahules786

The error I encountered was triggered by .from_langchain_document method, specifically the add_nodes method. What fixed the issue for me was updating the langchain packages:

pip uninstall langchain pip install langchain

To update langchain to:

  • langchain-core 0.1.33
  • langchain-community 0.0.29
  • langchain 0.1.13

.from_langchain_document works but I could not fully test .generate_with_langchain_docs method since its taking a lot of time and I am running on a CPU. Give it a go might work for you.

Also might be worth updating requirements file to include these langchain versions when RAGAS is installed?

Zvapo avatar Mar 26 '24 10:03 Zvapo