the generated testset is empty
[ ] I checked the documentation and related resources and couldn't find an answer to my question.
Your Question rags version: 0.28.0 I use the script in the documentation to generate non-english testset, but the output is empty.
Code Examples
chat_model = AzureChatOpenAI(
api_version=api_version,
model="gpt-4o",
azure_deployment="gpt-4o"
)
embedding_model = AzureOpenAIEmbeddings(
model="text-embedding-3-large-turing",
api_version=api_version,
azure_deployment="text-embedding-3-large-turing"
)
generator_llm = LangchainLLMWrapper(chat_model)
generator_embeddings = LangchainEmbeddingsWrapper(embedding_model)
personas = [
Persona(
name="curious student",
role_description="A student who is curious about the world and wants to learn more about different cultures and languages",
),
]
generator = TestsetGenerator(
llm=generator_llm, embedding_model=generator_embeddings, persona_list=personas
)
distribution = [
(SingleHopSpecificQuerySynthesizer(llm=generator_llm), 1.0),
]
path = "/data/eco_rag/testdata"
loader = DirectoryLoader(path, loader_cls=TextLoader, show_progress=True)
docs = loader.load()
for query, _ in distribution:
prompts = await query.adapt_prompts("spanish", llm=generator_llm)
query.set_prompts(**prompts)
dataset = generator.generate_with_langchain_docs(
docs,
testset_size=3,
query_distribution=distribution
)
Additional context the output log is here: 100%|██████████| 1/1 [00:00<00:00, 730.59it/s] Applying HeadlinesExtractor: 0%| | 0/1 [00:00<?, ?it/s] Property 'summary' already exists in node 'e425ec'. Skipping! Property 'summary_embedding' already exists in node 'e425ec'. Skipping! Generating Scenarios: 0%| | 0/1 [00:00<?, ?it/s] Generating Samples: 0it [00:00, ?it/s]
the document is only one downloaded from https://huggingface.co/datasets/explodinggradients/Sample_non_english_corpus
when i change the document (Tokyo.txt) to Madrid.txt, it works. Why does this situation occur?
I have tested models from all vendors, and only the GPT-4 model from OpenAI supports generating Chinese document test datasets. The others either have speed limitations or generate empty test datasets. I hope the official can quickly solve this problem so that this tool can become a useful tool.
I have tested models from all vendors, and only the GPT-4 model from OpenAI supports generating Chinese document test datasets. The others either have speed limitations or generate empty test datasets. I hope the official can quickly solve this problem so that this tool can become a useful tool. The Tongyi model supports Chinese document, i tried qwen-max and it works.
I have tested models from all vendors, and only the GPT-4 model from OpenAI supports generating Chinese document test datasets. The others either have speed limitations or generate empty test datasets. I hope the official can quickly solve this problem so that this tool can become a useful tool. The Tongyi model supports Chinese document, i tried qwen-max and it works.
Is there a speed limit?
I have tested models from all vendors, and only the GPT-4 model from OpenAI supports generating Chinese document test datasets. The others either have speed limitations or generate empty test datasets. I hope the official can quickly solve this problem so that this tool can become a useful tool. The Tongyi model supports Chinese document, i tried qwen-max and it works.
Is there a speed limit?
Yes, limited by TPM and QPM. You can refer to the dashscope official document here : https://help.aliyun.com/zh/dashscope/developer-reference/tongyi-thousand-questions-metering-and-billing
The qwen max model doesn't work either!
Same here in Thai language :[
The qwen max model doesn't work either!
我尝试过了qwen-max以及qwen-plus,他们时不时可以正常工作,我怀疑是大模型平台的限速导致无法输出。
This documentation shows how to generate test set on non-english language https://docs.ragas.io/en/stable/howtos/customizations/testgenerator/_language_adaptation/#load-and-adapt-queries
I encountered the same issue on deepseek-chat.
import asyncio
from langchain_community.document_loaders import TextLoader
from langchain_openai import ChatOpenAI
from langchain_core.callbacks import BaseCallbackHandler
from langchain_huggingface import HuggingFaceEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator
from ragas.testset.persona import Persona
from ragas.testset.synthesizers.single_hop.specific import SingleHopSpecificQuerySynthesizer
from ragas.testset.transforms.extractors.llm_based import NERExtractor
from ragas.testset.transforms.splitters import HeadlineSplitter
class TestCallback(BaseCallbackHandler):
def on_llm_start(self, serialized, prompts, **kwargs):
print(f"**********Prompts*********:\n {prompts[0]}\n\n")
def on_llm_end(self, response, **kwargs):
print(f"**********Response**********:\n {response}\n\n")
llm = ChatOpenAI(model='deepseek-chat', base_url='https://api.deepseek.com/v1', callbacks=[TestCallback()])
embeddings = HuggingFaceEmbeddings(model_name='BAAI/bge-m3', model_kwargs={'trust_remote_code': True})
loader = TextLoader('doc.txt', encoding='utf-8')
documents = loader.load()
personas = [
Persona(
name="Curious Student",
role_description="A student who is curious about the world and wants to learn more about different cultures and languages",
),
]
generator_llm = LangchainLLMWrapper(llm)
generator_embeddings = LangchainEmbeddingsWrapper(embeddings)
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings, persona_list=personas)
query = SingleHopSpecificQuerySynthesizer(llm=generator_llm)
prompts = asyncio.run(query.adapt_prompts('chinese', llm=generator_llm))
query.set_prompts(**prompts)
transforms = [HeadlineSplitter(), NERExtractor(llm=generator_llm)]
dist = [(query, 1.0)]
dataset = generator.generate_with_langchain_docs(documents, testset_size=1, transforms=transforms, query_distribution=dist)
**********Prompts*********:
Human: Given a list of themes and personas with their roles, associate each persona with relevant themes based on their role description.
Please return the output in a JSON format that complies with the following schema as specified in JSON Schema:
{"properties": {"mapping": {"additionalProperties": {"items": {"type": "string"}, "type": "array"}, "title": "Mapping", "type": "object"}}, "required": ["mapping"], "title": "PersonaThemesMapping", "type": "object"}Do not use single quotes in your response but double quotes,properly escaped with a backslash.
--------EXAMPLES-----------
Example 1
Input: {
"themes": [
"同理心",
"包容性",
"远程工作"
],
"personas": [
{
"name": "人力资源经理",
"role_description": "专注于包容性和员工支持。"
},
{
"name": "远程团队领导",
"role_description": "管理远程团队沟通。"
}
]
}
Output: {
"mapping": {
"HR Manager": [
"包容性",
"同理心"
],
"Remote Team Lead": [
"远程工作",
"同理心"
]
}
}
-----------------------------
Now perform the same with the following input
input: {
"themes": [
"RAGFlow",
"Docker",
"Elasticsearch",
"Infinity",
"MinIO",
"Redis",
"MySQL",
"HuggingFace",
"Python",
"Linux"
],
"personas": [
{
"name": "Curious Student",
"role_description": "A student who is curious about the world and wants to learn more about different cultures and languages"
}
]
}
Output:
Generating Scenarios: 100%|██████████| 1/1 [00:06<00:00, 6.75s/it]
**********Response**********:
generations=[[ChatGeneration(text='{\n "mapping": {\n "Curious Student": []\n }\n}', generation_info={'finish_reason': 'stop', 'logprobs': None}, message=AIMessage(content='{\n "mapping": {\n "Curious Student": []\n }\n}', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 17, 'prompt_tokens': 395, 'total_tokens': 412, 'completion_tokens_details': None, 'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 384}, 'prompt_cache_hit_tokens': 384, 'prompt_cache_miss_tokens': 11}, 'model_name': 'deepseek-chat', 'system_fingerprint': 'fp_3a2571e1b4_prod0225', 'finish_reason': 'stop', 'logprobs': None}, id='run-00f8660d-4ff7-4a16-90fe-d43b6647e0e5-0', usage_metadata={'input_tokens': 395, 'output_tokens': 17, 'total_tokens': 412, 'input_token_details': {'cache_read': 384}, 'output_token_details': {}}))]] llm_output={'token_usage': {'completion_tokens': 17, 'prompt_tokens': 395, 'total_tokens': 412, 'completion_tokens_details': None, 'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 384}, 'prompt_cache_hit_tokens': 384, 'prompt_cache_miss_tokens': 11}, 'model_name': 'deepseek-chat', 'system_fingerprint': 'fp_3a2571e1b4_prod0225'} run=None type='LLMResult'
Generating Samples: 0it [00:00, ?it/s]
I have this same problem but my dataset is in English, using ollama works well but using AzureOpenAI with GPT-4 models and the llama index integration it doesn't generate any samples. Has anyone found a workaround?