RAGAS generate testset task 生成问题时报错 Documents appears to be too short (ie 100 tokens or less)
问题描述 / Issue Description
生成问题时报错 Documents appears to be too short (ie 100 tokens or less) pdf文档为中文的资料文档,大小1M左右,页数在30~60之间。
执行的代码或指令 / Code or Commands Executed
from evalscope.run import run_task
from evalscope.utils.logger import get_logger
logger = get_logger()
generate_testset_task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "RAGAS",
"testset_generation": {
"docs": ["/home/wuchen/gridqa/data/raw/demo/pdf/1.pdf","/home/wuchen/gridqa/data/raw/demo/pdf/2.pdf","/home/wuchen/gridqa/data/raw/demo/pdf/3.pdf"],
"test_size": 5,
"output_file": "outputs/testset.json",
"knowledge_graph": "outputs/knowledge_graph.json",
"distribution": {"simple": 0.7, "multi_context": 0.2, "reasoning": 0.1},
"generator_llm": {
"api_base": "**************",
"api_key": "************************"
},
"embeddings": {
"model_name_or_path": "/home/wuchen/models/BAAI/bge-large-zh-v1___5",
},
"language": "chinese"
}
}
}
# Run task
run_task(task_cfg=generate_testset_task_cfg)
python eval_rag_gen.py
错误日志 / Error Log
2024-12-19 10:32:00,341 - datasets - INFO - PyTorch version 2.5.1 available.
2024-12-19 10:32:00,341 - datasets - INFO - Polars version 1.17.1 available.
2024-12-19 10:32:01,264 - evalscope - INFO - Args: Task config is provided with dictionary type.
2024-12-19 10:32:01,269 - evalscope - INFO - Dump task config to ./outputs/20241219_103201/configs/task_config_190d29.yaml
2024-12-19 10:32:01,270 - evalscope - INFO - {
"model": null,
"model_id": null,
"model_args": {
"revision": "master",
"precision": "torch.float16",
"device": "auto"
},
"template_type": null,
"chat_template": null,
"datasets": null,
"dataset_args": {},
"dataset_dir": "/home/wuchen/.cache/modelscope/datasets",
"dataset_hub": "modelscope",
"generation_config": {
"max_length": 2048,
"max_new_tokens": 512,
"do_sample": false,
"top_k": 50,
"top_p": 1.0,
"temperature": 1.0
},
"eval_type": "checkpoint",
"eval_backend": "RAGEval",
"eval_config": {
"tool": "RAGAS",
"testset_generation": {
"docs": [
"/home/wuchen/gridqa/data/raw/demo/pdf/1.pdf",
"/home/wuchen/gridqa/data/raw/demo/pdf/2.pdf",
"/home/wuchen/gridqa/data/raw/demo/pdf/3.pdf"
],
"test_size": 5,
"output_file": "outputs/testset.json",
"knowledge_graph": "outputs/knowledge_graph.json",
"distribution": {
"simple": 0.7,
"multi_context": 0.2,
"reasoning": 0.1
},
"generator_llm": {
"api_base": "******************",
"api_key": "********************************************"
},
"embeddings": {
"model_name_or_path": "/home/wuchen/models/BAAI/bge-large-zh-v1___5"
},
"language": "chinese"
}
},
"stage": "all",
"limit": null,
"mem_cache": false,
"use_cache": null,
"work_dir": "./outputs/20241219_103201",
"outputs": null,
"debug": false,
"dry_run": false,
"seed": 42
}
2024-12-19 10:32:01,729 - evalscope - INFO - Check `ragas` Installed
/home/wuchen/anaconda3/envs/evalscope/lib/python3.10/site-packages/evalscope/backend/rag_eval/ragas/tasks/testset_generation.py:72: LangChainDeprecationWarning: The class `UnstructuredFileLoader` was deprecated in LangChain 0.2.8 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-unstructured package and should be used instead. To use it run `pip install -U :class:`~langchain-unstructured` and import as `from :class:`~langchain_unstructured import UnstructuredLoader``.
loader = UnstructuredFileLoader(file_path, mode='single')
2024-12-19 10:32:15,340 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized
2024-12-19 10:37:05,288 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: cuda
2024-12-19 10:37:05,288 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: /home/wuchen/models/BAAI/bge-large-zh-v1___5
Traceback (most recent call last):
File "/home/wuchen/gridqa/tests/eval_rag_gen.py", line 29, in <module>
run_task(task_cfg=generate_testset_task_cfg)
File "/home/wuchen/anaconda3/envs/evalscope/lib/python3.10/site-packages/evalscope/run.py", line 36, in run_task
return run_single_task(task_cfg, run_time)
File "/home/wuchen/anaconda3/envs/evalscope/lib/python3.10/site-packages/evalscope/run.py", line 49, in run_single_task
return run_non_native_backend(task_cfg)
File "/home/wuchen/anaconda3/envs/evalscope/lib/python3.10/site-packages/evalscope/run.py", line 81, in run_non_native_backend
backend_manager.run()
File "/home/wuchen/anaconda3/envs/evalscope/lib/python3.10/site-packages/evalscope/backend/rag_eval/backend_manager.py", line 71, in run
self.run_ragas(testset_args, eval_args)
File "/home/wuchen/anaconda3/envs/evalscope/lib/python3.10/site-packages/evalscope/backend/rag_eval/backend_manager.py", line 50, in run_ragas
generate_testset(TestsetGenerationArguments(**testset_args))
File "/home/wuchen/anaconda3/envs/evalscope/lib/python3.10/site-packages/evalscope/backend/rag_eval/ragas/tasks/testset_generation.py", line 93, in generate_testset
transforms = default_transforms(
File "/home/wuchen/anaconda3/envs/evalscope/lib/python3.10/site-packages/evalscope/backend/rag_eval/ragas/tasks/build_transform.py", line 133, in default_transforms
raise ValueError('Documents appears to be too short (ie 100 tokens or less). Please provide longer documents.')
ValueError: Documents appears to be too short (ie 100 tokens or less). Please provide longer documents.
运行环境 / Runtime Environment
-
操作系统 / Operating System:
- [ ] Windows
- [ ] macOS
- [x] Ubuntu
-
Python版本 / Python Version:
- [ ] 3.11
- [x] 3.10
- [ ] 3.9
其他信息 / Additional Information
evalscope 0.8.1 ragas 0.2.7 langchain 0.3.13
这个是unstructured处理文档的问题,我们修复一下
evalscope v0.8.2已发布,兼容ragas v0.2.9,要彻底解决这个问题还得自行将pdf文件转为txt格式,unstructured 默认处理可能还是有问题
evalscope v0.8.2已发布,兼容ragas v0.2.9,要彻底解决这个问题还得自行将pdf文件转为txt格式,unstructured 默认处理可能还是有问题
升级之后,pdf转成txt,结果还是一样
可以提供一下样例数据吗
0.14报错2025-04-14 11:40:21,371 - evalscope - INFO - Check ragas Installed
Traceback (most recent call last):
File "/root/lgy/task-gen.py", line 28, in
可以提供一下样例数据吗
就是代码库里的README_zh.md也出现这个错误,文件后缀改成.txt也不行