Langchain-Chatchat icon indicating copy to clipboard operation
Langchain-Chatchat copied to clipboard

添加多个文件到知识库报错: 'NoneType' object has no attribute 'lower'

Open zixiaotan21 opened this issue 1 year ago • 7 comments

问题描述 / Problem Description 添加多个文件到知识库报错 'NoneType' object has no attribute 'lower'

复现问题的步骤 / Steps to Reproduce

  1. 使用分词器:MarkdownHeaderTextSplitter
  2. 添加一个txt文件到知识库没问题。
  3. 添加第二个txt文件到知识库,报错 'NoneType' object has no attribute 'lower'。
  4. 删除知识库文件,同样报错'NoneType' object has no attribute 'lower'

预期的结果 / Expected Result 无报错

实际结果 / Actual Result INFO: 127.0.0.1:51540 - "POST /knowledge_base/upload_docs HTTP/1.1" 500 Internal Server Error 2024-04-07 10:02:22,161 - _client.py[line:1027] - INFO: HTTP Request: POST http://127.0.0.1:7861/knowledge_base/upload_docs "HTTP/1.1 500 Internal Server Error" ERROR: Exception in ASGI application Traceback (most recent call last): File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\uvicorn\protocols\http\h11_impl.py", line 408, in run_asgi result = await app( # type: ignore[func-returns-value] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\uvicorn\middleware\proxy_headers.py", line 69, in call return await self.app(scope, receive, send) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\fastapi\applications.py", line 1054, in call await super().call(scope, receive, send) File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\middleware\errors.py", line 186, in call raise exc File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\middleware\errors.py", line 164, in call await self.app(scope, receive, _send) File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\middleware\exceptions.py", line 62, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette_exception_handler.py", line 64, in wrapped_app raise exc File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\routing.py", line 762, in call await self.middleware_stack(scope, receive, send) File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\routing.py", line 782, in app await route.handle(scope, receive, send) File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\routing.py", line 297, in handle await self.app(scope, receive, send) File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette_exception_handler.py", line 64, in wrapped_app raise exc File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\routing.py", line 72, in app response = await func(request) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\fastapi\routing.py", line 299, in app raise e File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\fastapi\routing.py", line 294, in app raw_response = await run_endpoint_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\fastapi\routing.py", line 193, in run_endpoint_function return await run_in_threadpool(dependant.call, **values) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\concurrency.py", line 40, in run_in_threadpool return await anyio.to_thread.run_sync(func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\anyio\to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\anyio_backends_asyncio.py", line 2144, in run_sync_in_worker_thread return await future ^^^^^^^^^^^^ File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\anyio_backends_asyncio.py", line 851, in run result = context.run(func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\butterfly\Langchain-Chatchat\server\knowledge_base\kb_doc_api.py", line 170, in upload_docs result = update_docs( ^^^^^^^^^^^^ File "C:\Users\butterfly\Langchain-Chatchat\server\knowledge_base\kb_doc_api.py", line 287, in update_docs kb.update_doc(kb_file, not_refresh_vs_cache=True) File "C:\Users\butterfly\Langchain-Chatchat\server\knowledge_base\kb_service\base.py", line 164, in update_doc self.delete_doc(kb_file, **kwargs) File "C:\Users\butterfly\Langchain-Chatchat\server\knowledge_base\kb_service\base.py", line 144, in delete_doc self.do_delete_doc(kb_file, **kwargs) File "C:\Users\butterfly\Langchain-Chatchat\server\knowledge_base\kb_service\faiss_kb_service.py", line 91, in do_delete_doc ids = [k for k, v in vs.docstore._dict.items() if v.metadata.get("source").lower() == kb_file.filename.lower()] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\butterfly\Langchain-Chatchat\server\knowledge_base\kb_service\faiss_kb_service.py", line 91, in ids = [k for k, v in vs.docstore._dict.items() if v.metadata.get("source").lower() == kb_file.filename.lower()] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'lower'

环境信息 / Environment Information langchain-ChatGLM 版本/commit 号:v0.2.10 是否使用 Docker 部署(是/否):否 使用的模型(ChatGLM2-6B / Qwen-7B 等):zhipu-api 使用的 Embedding 模型(moka-ai/m3e-base 等):bge-large-zh-v1.5 使用的向量库类型 (faiss / milvus / pg_vector 等): faiss 操作系统及版本 / Operating system and version: :Windows-10-10.0.22631-SP0. Python 版本 / Python version: :3.11.8 其他相关环境信息 / Other relevant environment information:

附加信息 / Additional Information @zRzRzRzRzRzRzR 维护者你好,请问是我使用了MarkdownHeaderTextSplitter分词器的原因,导致了报错吗?

zixiaotan21 avatar Apr 07 '24 02:04 zixiaotan21

Hey @zixiaotan21, great to see you back here wrestling with the code! 🐙

Based on the error message you provided, it seems like the issue is not directly related to the MarkdownHeaderTextSplitter tokenizer. The error 'NoneType' object has no attribute 'lower' typically occurs when an operation is attempted on a None object that is expected to be a string (or another type that has the lower method).

In the context of Langchain-Chatchat, there are several places where .lower() is called on file extensions or paths, particularly in the KnowledgeFile class and the list_files_from_folder function. If any of these locations receive a None value instead of a string, it could lead to this error.

Here are a few things you could check:

  • Ensure that all file paths and names are correctly retrieved and are not None before performing operations like .lower().
  • Verify that the ext attribute of a KnowledgeFile object is always a string. The error could occur if os.path.splitext(filename)[-1].lower() results in None being assigned to self.ext if the filename does not have an extension.
  • Check the logic in list_files_from_folder and other places where .lower() is used to ensure that the objects being operated on are not None.

In the faiss_kb_service.py file, the .lower() method that could potentially receive a None object is called in the do_delete_doc method of the FaissKBService class. This occurs in the line:

ids = [k for k, v in vs.docstore._dict.items() if v.metadata.get("source").lower() == kb_file.filename.lower()]

Here, v.metadata.get("source") could return None if the "source" key does not exist in the metadata dictionary, and calling .lower() on None would raise an AttributeError.

As for adding multiple files to the knowledge base, there is a specific order or process to follow. The process involves several steps such as selecting or creating a knowledge base, uploading files, configuring file processing settings, adding the files to the knowledge base, and optionally managing the files and the knowledge base afterward.

In summary, the error seems to be related to handling of file paths or extensions rather than the MarkdownHeaderTextSplitter tokenizer directly. Additional debugging or error handling around string operations and file path manipulations is recommended to resolve this issue.

For more details, you can refer to the following sources:

I hope this helps! If you need further assistance, please provide more details about the specific steps you took and the full traceback of the error.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] avatar Apr 07 '24 02:04 dosubot[bot]

不是的,这个根分词没关系

zRzRzRzRzRzRzR avatar Apr 09 '24 15:04 zRzRzRzRzRzRzR

same bug, wait for the answer

baonguyenduy2002 avatar Apr 11 '24 10:04 baonguyenduy2002

同样的bug

FlaringR avatar Apr 30 '24 08:04 FlaringR

同样的问题,分词器换回默认的之后,上传第一个文件没问题,删除也报这个错,后续也无法上传文件

GLY-123 avatar May 08 '24 06:05 GLY-123

同样的问题,分词器换回默认的之后,上传第一个文件没问题,删除也报这个错,后续也无法上传文件

自定义分词器可能会导致这个问题,报错信息ids = [k for k, v in vs.docstore._dict.items() if v.metadata.get("source").lower() == kb_file.filename.lower()],这个问题可能是自定义分词器时medadata["source"]没有赋予正确的值,可以在处理medadata时加上medata["source"]=self.filepath

GLY-123 avatar May 13 '24 12:05 GLY-123

绕过的方法是重写一个 MarkdownHeaderTextSplitter, 在调用 docs = text_splitter.split_text(docs[0].page_content)时能把metadata['source']给带进去 docs = text_splitter.split_text(docs[0].page_content, docs[0].metadata['source'])

不过这个MarkDownHeaderTextSplitter和系统结合的比较粗糙,需要自己重写一下。

UnstructuredMarkdownLoader 会把文档中的##给删掉, 这个splitter反而没有效果了

9DemonFox avatar May 14 '24 07:05 9DemonFox

我尝试了一下,好像将faiss_kb_service.py 下面这行代码:

ids = [k for k, v in vs.docstore._dict.items() if v.metadata.get("source").lower() == kb_file.filename.lower()]

改成下面这样就可以啦:

            ids = [k for k, v in vs.docstore._dict.items() if v.metadata.get("source") and v.metadata.get("source").lower() == kb_file.filename.lower()]

或者可以尝试一下这样行不行?

zixiaotan21 avatar May 26 '24 08:05 zixiaotan21

相同的bug

guanhd avatar Jun 18 '24 08:06 guanhd

我变通了下,现在可以多次新增和删除了,修改knowledge_base/utils.py,在text_splitter.split_text后遍历docs,把metadata["source"] 设置进去,亲测有效 docs = text_splitter.split_text(docs[0].page_content) 改为
source = os.path.basename(docs[0].metadata['source']) docs = text_splitter.split_text(docs[0].page_content) for doc in docs: if doc.metadata: doc.metadata["source"] = source

guanhd avatar Jun 20 '24 01:06 guanhd