Langchain-Chatchat
Langchain-Chatchat copied to clipboard
添加文件到知识库,上传doc中有图片会上传不成功
问题描述 / Problem Description 添加文件到知识库,上传doc中有图片会上传不成功
复现问题的步骤 / Steps to Reproduce 就是上传带有截图的doc到知识库
预期的结果 / Expected Result 源文件和向量库都保存成功
实际结果 / Actual Result 向量库失败
环境信息 / Environment Information
- langchain-ChatGLM 版本/commit 号:v0.2.10
- 是否使用 Docker 部署(是/否):否
- 使用的模型(ChatGLM2-6B / Qwen-7B 等):ChatGLM2-6B
- 使用的 Embedding 模型(moka-ai/m3e-base 等):m3e-base
- 使用的向量库类型 (faiss / milvus / pg_vector 等): faiss
- 操作系统及版本 / Operating system and version: centos
- Python 版本 / Python version: 3.10.12
- 其他相关环境信息 / Other relevant environment information:
附加信息 / Additional Information 报错信息类似 utils.py[line:383] - ERROR: PackageNotFoundError: 从文件 xxx.doc 加载文档时出错:Package not found at 'path/Langchain-Chatchat_0_2_10/knowledge_base/xxx/content/xxx.doc'
Based on the error message you provided, it seems like the system is unable to find the .doc file you're trying to upload. This could be due to several reasons, including file format compatibility, missing dependencies, or even a typo in the code.
Here are some potential solutions based on similar issues found in the Langchain-Chatchat repository:
-
Convert the .doc file to .docx or PDF: According to a suggestion in the issue [BUG] 添加知识库文件doc,docx文件失败, converting the .doc file to .docx or PDF might resolve the issue. This is because the system might be better equipped to handle these file formats.
-
Check for a typo in the code: The same issue also mentions a potential typo in
server/knowledge_base\utils.py
line 106. The correct line should be "UnstructuredWordDocumentLoader": ['.docx', '.doc'], instead of "UnstructuredWordDocumentLoader": ['.docx', 'doc']. This typo could be causing the system to not recognize .doc files. -
Update the project code and dependencies: The issue 自定义知识库 上传、删除等报错 suggests updating the project code and dependencies, especially
langchain/fschat
. This could resolve any compatibility issues that might be causing the error.
If none of these solutions work, it would be helpful to have more information about the steps you took before encountering the error, as well as any other error messages you might have received. This will help in diagnosing the problem more accurately.
Sources
- [BUG] 添加知识库文件doc,docx文件失败
- 自定义知识库 上传、删除等报错
- knowledge_base/samples/content/test_files/langchain-ChatGLM_closed.jsonl
- webui_pages/utils.py
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
已经看了。第二点server/knowledge_base\utils.py line 106. The correct line should be "UnstructuredWordDocumentLoader": ['.docx', '.doc'],在v0.2.10版本出现的问题,已经修复了的。bug现象是,上传一个doc中无包含图片的可以成功上传,有包含图片就会报以上的异常信息。
您好,你怎么解决的
您好,你怎么解决的
还未解决呢,同志
问题描述 / Problem Description 添加文件到知识库,上传doc中有图片会上传不成功
复现问题的步骤 / Steps to Reproduce 就是上传带有截图的doc到知识库
预期的结果 / Expected Result 源文件和向量库都保存成功
实际结果 / Actual Result 向量库失败
环境信息 / Environment Information
- langchain-ChatGLM 版本/commit 号:v0.2.10
- 是否使用 Docker 部署(是/否):否
- 使用的模型(ChatGLM2-6B / Qwen-7B 等):ChatGLM2-6B
- 使用的 Embedding 模型(moka-ai/m3e-base 等):m3e-base
- 使用的向量库类型 (faiss / milvus / pg_vector 等): faiss
- 操作系统及版本 / Operating system and version: centos
- Python 版本 / Python version: 3.10.12
- 其他相关环境信息 / Other relevant environment information:
附加信息 / Additional Information 报错信息类似 utils.py[line:383] - ERROR: PackageNotFoundError: 从文件 xxx.doc 加载文档时出错:Package not found at 'path/Langchain-Chatchat_0_2_10/knowledge_base/xxx/content/xxx.doc'
再次排查发现,是因为按顺序默认使用的是 server/knowledge_base/utils.py 的LOADER_DICT 的 "RapidOCRDocLoader": ['.docx', '.doc'], 用的是这个加载器,才会报错,将这个注释后,使用"UnstructuredWordDocumentLoader": ['.docx', '.doc'], 这个加载器就可以上传带有图片的doc文件
并未解决,docx文件也有一定概率会报错,然后RapidOCRDocLoader和UnstructuredWordDocumentLoader都会遇到这种错误。
,
所以这个问题是现在还没解决吗,是一个doc文件可能有概率会报错 还是有问题的doc文件一定会报错
要用docx