Langchain-Chatchat 批量上传文档bug

批量上传文档，当某一个文档无法被添加到向量库就会引起整个上传进程中断。批量上传有时还会导致内存占用非常高而进程中断

Feb 25 '24 17:02 goodmaney

🤖

嘿，@goodmaney！欢迎回来，总是很高兴看到你提出的新挑战。希望这次我们也能找到解决方法！

针对您提到的批量上传文档时，如果某个文档无法读取会导致整个上传进程中断，以及批量上传可能会导致内存占用过高的问题，我有以下几点建议：

错误处理：在现有的代码中，已经实现了一些错误处理策略。例如，在save_file函数中，有一个try-except块，用于捕获读取文件内容和写入新文件时可能出现的任何异常。如果出现错误，它会记录错误并返回一个包含错误消息的字典。这可以防止如果一个文件出现错误，整个批量上传过程被中断。

try:
    ...
    file_content = file.file.read()  # 读取上传文件的内容
    ...
    with open(file_path, "wb") as f:
        f.write(file_content)  # 将内容写入新文件
    ...
except Exception as e:
    msg = f"{filename} 文件上传失败，错误信息: {e}"
    logger.error(f'{e.__class__.__name__}: {msg}',
                 exc_info=e if log_verbose else None)
    return dict(code=500, msg=msg, data=data)

内存管理：如果在批量上传过程中，内存占用过高，您可能需要实现一些内存管理策略。一种策略可能是分块读取和写入文件，而不是一次性读取和写入，以减少内存使用。然而，这种策略在提供的代码中并未实现。

请注意，这些都是一般性的策略，可能需要根据您的具体使用情况进行调整。如果您在批量上传过程中遇到特定的问题，可能需要实现额外的错误处理或内存管理策略。

关于您提到的问题，我建议您可以尝试修改upload_docs函数中的代码，以实现更好的错误处理和内存管理。这个函数是Langchain-Chatchat代码库中用于批量上传文档的当前方法。这个函数接受一个要上传的文件列表，要上传的知识库的名称，以及其他几个参数。文件首先使用_save_files_in_thread函数保存到磁盘，该函数使用多线程将上传的文件保存到相应的知识库目录。如果to_vector_store参数设置为True，那么保存的文件将被向量化。

这是相关的代码：

def upload_docs(
        files: List[UploadFile] = File(..., description="上传文件，支持多文件"),
        knowledge_base_name: str = Form(..., description="知识库名称", examples=["samples"]),
        override: bool = Form(False, description="覆盖已有文件"),
        to_vector_store: bool = Form(True, description="上传文件后是否进行向量化"),
        chunk_size: int = Form(CHUNK_SIZE, description="知识库中单段文本最大长度"),
        chunk_overlap: int = Form(OVERLAP_SIZE, description="知识库中相邻文本重合长度"),
        zh_title_enhance: bool = Form(ZH_TITLE_ENHANCE, description="是否开启中文标题加强"),
        docs: Json = Form({}, description="自定义的docs，需要转为json字符串",
                          examples=[{"test.txt": [Document(page_content="custom doc")]}]),
        not_refresh_vs_cache: bool = Form(False, description="暂不保存向量库（用于FAISS）"),
) -> BaseResponse:
    """
    API接口：上传文件，并/或向量化
    """
    ...
    # 先将上传的文件保存到磁盘
    for result in _save_files_in_thread(files, knowledge_base_name=knowledge_base_name, override=override):
        ...
    # 对保存的文件进行向量化
    if to_vector_store:
        ...
    return BaseResponse(code=200, msg="文件上传与向量化完成", data={"failed_files": failed_files})

希望这些信息对您有所帮助。如果您有任何其他问题或需要进一步的帮助，请随时告诉我。

Sources

server/knowledge_base/kb_doc_api.py

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Feb 25 '24 17:02 dosubot[bot]

是的，这个开源项目是做成顺发的，暂时没有考虑这个

Feb 28 '24 04:02 zRzRzRzRzRzRzR

Langchain-Chatchat Langchain-Chatchat copied to clipboard

批量上传文档bug

Sources

About Dosu

Langchain-Chatchat
Langchain-Chatchat copied to clipboard