Langchain-Chatchat [BUG] 简洁阐述问题 / python init

[BUG] 简洁阐述问题 / python init_database.py --recreate-vs提示

Open youngsasa2021 opened this issue 11 months ago • 1 comments

问题描述 / Problem Description 用简洁明了的语言描述这个问题 / Describe the problem in a clear and concise manner.

复现问题的步骤 / Steps to Reproduce

执行 python init_database.py --recreate-vs（之前完全按照说明步骤进行操作，只修改了model_config种的model及embedding路径）
点击 '...' / Click '...'
滚动到 '...' / Scroll to '...'
问题出现 / Traceback (most recent call last): File "/home/ymr/Langchain-Chatchat-master/Langchain-Chatchat-master/init_database.py", line 107, in folder2db(kb_names=args.kb_name, mode="recreate_vs", embed_model=args.embed_model) File "/home/ymr/Langchain-Chatchat-master/Langchain-Chatchat-master/server/knowledge_base/migrate.py", line 128, in folder2db files2vs(kb_name, kb_files) File "/home/ymr/Langchain-Chatchat-master/Langchain-Chatchat-master/server/knowledge_base/migrate.py", line 110, in files2vs print(f"正在将 {kb_name}/{filename} 添加到向量库，共包含{len(docs)}条文档") UnicodeEncodeError: 'utf-8' codec can't encode characters in position 16-17: surrogates not allowed

预期的结果 / Expected Result 描述应该出现的结果 / 成功将文件存储在数据库种

实际结果 / Actual Result 报错unicodeError

环境信息 / Environment Information

langchain-ChatGLM 版本/commit 号：用的是https://github.com/THUDM/ChatGLM-6B的下载的版本
是否使用 Docker 部署（是/否）：否
使用的模型（ChatGLM2-6B / Qwen-7B 等）：ChatGLM-6B
使用的 Embedding 模型（moka-ai/m3e-base 等）：bge-large-zh
使用的向量库类型 (faiss / milvus / pg_vector 等)：默认的
操作系统及版本 / 实验室服务器dell-PowerEdge-R740
Python 版本 / Python version:python3.10
其他相关环境信息 / Other relevant environment information:

Mar 14 '24 06:03 youngsasa2021

嘿，@youngsasa2021！很高兴再次见到你，希望你今天状态不错！遇到新挑战了吗？

根据您提供的信息，您在执行 python init_database.py --recreate-vs 时遇到了 UnicodeEncodeError。这个错误通常在 Python 环境尝试处理包含系统默认编码不支持的字符的字符串时出现。

在您的情况下，可能的原因包括：

文件路径和数据：如果您在 model_config 中修改的路径包含非 ASCII 字符，或者正在处理的数据（例如，正在导入数据库的文档）包含这样的字符，而您的系统的默认编码无法处理它们，这可能会导致 UnicodeEncodeError。
打印到控制台：脚本包含打印语句（例如，print("recreating all vector stores")）。如果输出包含非 ASCII 字符（无论是直接在打印语句中还是作为正在打印的变量的一部分），并且控制台不支持这些字符，可能会导致错误。
外部文件和数据库操作：脚本与外部文件和数据库进行交互（例如，通过 folder2db，prune_db_docs，prune_folder_files）。如果这些操作在文件名、文件内容或数据库条目中遇到非 ASCII 字符，并且这些字符没有得到适当的处理，可能会导致 UnicodeEncodeError。

要解决此问题，确保您的环境设置为使用 Unicode 友好的编码，例如 UTF-8。在运行脚本之前，您可以将 Python 环境变量 PYTHONIOENCODING 设置为 utf-8：

export PYTHONIOENCODING=utf-8
python init_database.py --recreate-vs

或者，您可以修改脚本以显式处理编码问题。例如，打开文件时，确保您指定了编码：

with open('somefile.txt', 'r', encoding='utf-8') as f:
    # Your code here

此外，我在 Langchain-Chatchat 的问题跟踪中找到了一个类似的问题 #2756。在那个问题中，用户在处理文件内容时遇到了类似的 UnicodeEncodeError。他们的解决方案是在读取文件内容时忽略无法用 'utf-8' 编码的任何字符：

file_content = str(file.file.read(), encoding='utf-8', errors='ignore').encode('utf-8')

这段代码建议通过在读取文件内容时忽略无法用 'utf-8' 编码的任何字符来处理编码错误。如果文件中有很多这样的字符，这种方法可能会导致数据丢失，但它可以让程序在不遇到 UnicodeEncodeError 的情况下运行。

希望这些信息对您有所帮助。如果问题仍然存在，或者您有其他问题，欢迎随时向我提问。

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Mar 14 '24 06:03 dosubot[bot]

Langchain-Chatchat Langchain-Chatchat copied to clipboard

[BUG] 简洁阐述问题 / python init_database.py --recreate-vs提示

Sources

About Dosu

Langchain-Chatchat
Langchain-Chatchat copied to clipboard