Customizing Testset Generation: list index out of range
Your Question
- When customising Testset Generation, I used
InMemoryDocumentStoreandLCDocumentfromlangchain. - An error occurred during calling
generate_with_langchain_docsfunction, during debugging, the detailed error message is as follows:
Connected to pydev debugger (build 221.5080.212) Created a chunk of size 593, which is longer than the specified 400 Created a chunk of size 1090, which is longer than the specified 400 Created a chunk of size 711, which is longer than the specified 400 Created a chunk of size 597, which is longer than the specified 400 Created a chunk of size 478, which is longer than the specified 400 Traceback (most recent call last). File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/ragas/executor.py", line 75, in run results = self.loop.run_until_complete(self._aresults()) File "/opt/conda/envs/evaluation-ljx/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/ragas/executor.py", line 63, in _aresults raise e File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/ragas/executor.py", line 58, in _aresults r = await future File "/opt/conda/envs/evaluation-ljx/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one return f.result() # May raise f.exception(). File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/ragas/executor.py", line 91, in wrapped_callable_async return counter, await callable(*args, **kwargs) File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/ragas/testset/extractor.py", line 49, in extract results = await self.llm.generate(prompt=prompt, is_async=is_async) File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/ragas/llms/base.py", line 92, in generate python-BaseException return await agenerate_text_with_retry( File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/tenacity/_asyncio.py", line 88, in async_wrapped return await fn(*args, **kwargs) File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/tenacity/_asyncio.py", line 47, in call do = self.iter(retry_state=retry_state) File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/tenacity/init.py", line 325, in iter raise retry_exc.reraise() File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/tenacity/init.py", line 158, in reraise raise self.last_attempt.result() File "/opt/conda/envs/evaluation-ljx/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/opt/conda/envs/evaluation-ljx/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/tenacity/asyncio.py", line 50, in call result = await fn(*args, **kwargs) File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/ragas/llms/base.py", line 177, in agenerate_text result = await self.langchain_llm.agenerate_prompt( File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/langchain_core/language_models/llms.py", line 578, in agenerate prompt return await self.agenerate( File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/langchain_core/language_models/llms.py", line 853, in agenerate isinstance(callbacks[0], (list, BaseCallbackManager)) IndexError: list index out of range
Process finished with exit code -1
- I don't know much about asyn, so I can only pinpointed that the
add_nodesfunction ofInMemoryDocumentStoretrigger the error - File "/opt/conda/envs/evaluation-ljx/lib/python3.10/site-packages/langchain_core/language_models/llms.py", line 853, the corresponding code is as follows:
# Create callback managers
if isinstance(callbacks, list) and (
isinstance(callbacks[0], (list, BaseCallbackManager))
or callbacks[0] is None
).
I found callbacks=[], which triggered the callbacks[0] index out of range
Code Examples
from langchain_community.document_loaders import TextLoader
from ragas.testset.generator import TestsetGenerator
from ragas.llms.base import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_community.llms.xinference import Xinference
from langchain_community.embeddings import XinferenceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from ragas.testset.docstore import InMemoryDocumentStore
from ragas.testset.extractor import KeyphraseExtractor
from ragas.testset.evolutions import simple, reasoning, multi_context
if __name__ == '__main__':
loader = TextLoader(file_path='chap01.tex', encoding='utf-8')
document = loader.load()
for d in document:
d.metadata['file_name'] = d.metadata['source']
llm = Xinference(server_url="http://0.0.0.0:9997", model_uid='qwen-chat')
langchain_llm = LangchainLLMWrapper(langchain_llm=llm)
embeddings = XinferenceEmbeddings(server_url="http://0.0.0.0:9997", model_uid='bge-large-zh-v1.5')
langchain_embeddings = LangchainEmbeddingsWrapper(embeddings=embeddings)
generator = TestsetGenerator(generator_llm=langchain_llm, critic_llm=langchain_llm, embeddings=langchain_embeddings,
docstore=InMemoryDocumentStore(
splitter=CharacterTextSplitter(chunk_size=400, chunk_overlap=0),
extractor=KeyphraseExtractor(llm=langchain_llm), embeddings=langchain_embeddings))
test_dataset = generator.generate_with_langchain_docs(documents=document, test_size=10,
distributions={simple: 0.5, reasoning: 0.25,
multi_context: 0.25}, is_async=False)
pandas = test_dataset.to_pandas()
print(pandas)
Additional context
- I found that in package
ragas.llm.base, function likeagenerate_textandgenerateetc. all usecallbacks: Callbacks = []as default, and I can't say for sure whether the default cause the problem or not, since I stopped debugging when the traceback jump into the asyn part.
- These lines raise the error
test_dataset = generator.generate_with_langchain_docs(documents=document, test_size=10,
distributions={simple: 0.5, reasoning: 0.25,
multi_context: 0.25}, is_async=False)
- In function
generate_with_langchain_docs,
def generate_with_langchain_docs(
self,
documents: t.Sequence[LCDocument],
test_size: int,
distributions: Distributions = {},
with_debugging_logs=False,
is_async: bool = True,
raise_exceptions: bool = True,
run_config: t.Optional[RunConfig] = None
):
# chunk documents and add to docstore
self.docstore.add_documents(
[Document.from_langchain_document(doc) for doc in documents]
)
return self.generate(
test_size=test_size,
distributions=distributions,
with_debugging_logs=with_debugging_logs,
is_async=is_async,
raise_exceptions=raise_exceptions,
run_config=run_config,
)
and the following part cause the error,
self.docstore.add_documents(
[Document.from_langchain_document(doc) for doc in documents]
)
due to the function add_documents
Interesting, can you take a look at this @jjmachan ?
@jjmachan, @shahules786 I encountered the same issue when I used Google LLMs to generate the test set. It also returned an IndexError: list index out of range. This bug resulted in doubling the number of nodes. For instance, in the following script:
test_dataset = generator.generate_with_langchain_docs(documents=document, test_size=10,
distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})
Assuming my document size is 100, I would expect it to add 100 nodes in the above step. However, when I ran it with Google LLMs, it appeared to add 200.
However, when I ran it with ragas' .with_openai models, it ran smoothly and had 100 nodes in the process.
Hello @chizhang-lbg It also ran smoothly for me when using openai models, like
llm = ChatOpenAI()
langchain_llm = LangchainLLMWrapper(langchain_llm=llm)
but it got stuck and throw an error if i add open_api_base to it, like this
llm = ChatOpenAI(openai_api_base='https://api.chatanywhere.tech')
mostly it'll stop at 47% or 48% during embedding nodes: 47%
And did you encounter list out of range with the callbacks here?
# Create callback managers
if isinstance(callbacks, list) and (
isinstance(callbacks[0], (list, BaseCallbackManager))
or callbacks[0] is None
).
Got the same error as @GolfHotelSierra. I am using Langchain LLM, embedding model and an InMemoryDocumentStore.
Error:
Traceback (most recent call last):
File "...\lib\threading.py", line 932, in _bootstrap_inner
self.run()
File "...\Python38\site-packages\ragas\executor.py", line 93, in run
results = self.loop.run_until_complete(self._aresults())
File "...\lib\asyncio\base_events.py", line 616, in run_until_complete
return future.result()
File "...\Python38\site-packages\ragas\executor.py", line 81, in _aresults
raise e
File "...\Python38\site-packages\ragas\executor.py", line 76, in aresults
r = await future
File "...\lib\asyncio\tasks.py", line 619, in wait_for_one
return f.result() # May raise f.exception().
File "...\Python38\site-packages\ragas\executor.py", line 36, in sema_coro
return await coro
File "...\Python38\site-packages\ragas\executor.py", line 109, in wrapped_callable_async
return counter, await callable(*args, **kwargs)
File "...\Python38\site-packages\ragas\testset\extractor.py", line 49, in extract
results = await self.llm.generate(prompt=prompt, is_async=is_async)
File "...\Python38\site-packages\ragas\llms\base.py", line 92, in generate
return await agenerate_text_with_retry(
File "...\site-packages\tenacity_asyncio.py", line 88, in async_wrapped
return await fn(*args, **kwargs)
File "...\lib\site-packages\tenacity_asyncio.py", line 47, in call
do = self.iter(retry_state=retry_state)
File "...\lib\site-packages\tenacity_init.py", line 325, in iter
raise retry_exc.reraise()
File "...\lib\site-packages\tenacity_init.py", line 158, in reraise
raise self.last_attempt.result()
File "...\lib\concurrent\futures_base.py", line 437, in result
return self.__get_result()
File "...\lib\concurrent\futures_base.py", line 389, in __get_result
raise self._exception
File "...\lib\site-packages\tenacity_asyncio.py", line 50, in call
result = await fn(*args, **kwargs)
File "...\Python38\site-packages\ragas\llms\base.py", line 179, in agenerate_text
result = await self.langchain_llm.agenerate_prompt(
File "...\Python38\site-packages\langchain_core\language_models\llms.py", line 578, in agenerate_prompt
return await self.agenerate(
File "...\Python38\site-packages\langchain_core\language_models\llms.py", line 853, in agenerate
isinstance(callbacks[0], (list, BaseCallbackManager))
IndexError: list index out of range
Traceback (most recent call last):
File "test.py", line 103, in
The code I ran:
ollama_llm = Ollama(model="llama2:7b-chat-q4_0")
langchain_llm_ollama = LangchainLLMWrapper(ollama_llm)
embedding_model = HuggingFaceEmbeddings(
model_name=embedding_model_path,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs
)
langchain_embeddings = LangchainEmbeddingsWrapper(embedding_model)
langchain_llm_ollama = LangchainLLMWrapper(ollama_llm)
splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=40)
keyphrase_extractor = KeyphraseExtractor(llm=langchain_llm_ollama)
docstore = InMemoryDocumentStore(
splitter=splitter,
embeddings=langchain_embeddings,
extractor=keyphrase_extractor,
)
if __name__ == "__main__":
test_set = generator.generate_with_langchain_docs(
flat_documents,
test_size=10,
distributions=distributions)
pandas_test_set = test_set.to_pandas()
Hey guys @GolfHotelSierra @chizhang-lbg @Zvapo We are taking issues one by one and will certainly get to this. As a small team, we would really appreciate any help from you guys to improve our project :)
The error I encountered was triggered by .from_langchain_document method, specifically the add_nodes method. What fixed the issue for me was updating the langchain packages:
pip uninstall langchain
pip install langchain
To update langchain to:
- langchain-core 0.1.33
- langchain-community 0.0.29
- langchain 0.1.13
.from_langchain_document works but I could not fully test .generate_with_langchain_docs method since its taking a lot of time and I am running on a CPU. Give it a go might work for you.
Also might be worth updating requirements file to include these langchain versions when RAGAS is installed?