Langchain-Chatchat
Langchain-Chatchat copied to clipboard
添加多个文件到知识库报错: 'NoneType' object has no attribute 'lower'
问题描述 / Problem Description 添加多个文件到知识库报错 'NoneType' object has no attribute 'lower'
复现问题的步骤 / Steps to Reproduce
- 使用分词器:MarkdownHeaderTextSplitter
- 添加一个txt文件到知识库没问题。
- 添加第二个txt文件到知识库,报错 'NoneType' object has no attribute 'lower'。
- 删除知识库文件,同样报错'NoneType' object has no attribute 'lower'
预期的结果 / Expected Result 无报错
实际结果 / Actual Result
INFO: 127.0.0.1:51540 - "POST /knowledge_base/upload_docs HTTP/1.1" 500 Internal Server Error
2024-04-07 10:02:22,161 - _client.py[line:1027] - INFO: HTTP Request: POST http://127.0.0.1:7861/knowledge_base/upload_docs "HTTP/1.1 500 Internal Server Error"
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\uvicorn\protocols\http\h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\uvicorn\middleware\proxy_headers.py", line 69, in call
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\fastapi\applications.py", line 1054, in call
await super().call(scope, receive, send)
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\middleware\errors.py", line 186, in call
raise exc
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\middleware\errors.py", line 164, in call
await self.app(scope, receive, _send)
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\middleware\exceptions.py", line 62, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette_exception_handler.py", line 64, in wrapped_app
raise exc
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\routing.py", line 762, in call
await self.middleware_stack(scope, receive, send)
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\routing.py", line 782, in app
await route.handle(scope, receive, send)
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\routing.py", line 297, in handle
await self.app(scope, receive, send)
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette_exception_handler.py", line 64, in wrapped_app
raise exc
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\routing.py", line 72, in app
response = await func(request)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\fastapi\routing.py", line 299, in app
raise e
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\fastapi\routing.py", line 294, in app
raw_response = await run_endpoint_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\fastapi\routing.py", line 193, in run_endpoint_function
return await run_in_threadpool(dependant.call, **values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\starlette\concurrency.py", line 40, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\anyio\to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\anyio_backends_asyncio.py", line 2144, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "C:\Users\butterfly\anaconda3\envs\chatchat_env\Lib\site-packages\anyio_backends_asyncio.py", line 851, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\butterfly\Langchain-Chatchat\server\knowledge_base\kb_doc_api.py", line 170, in upload_docs
result = update_docs(
^^^^^^^^^^^^
File "C:\Users\butterfly\Langchain-Chatchat\server\knowledge_base\kb_doc_api.py", line 287, in update_docs
kb.update_doc(kb_file, not_refresh_vs_cache=True)
File "C:\Users\butterfly\Langchain-Chatchat\server\knowledge_base\kb_service\base.py", line 164, in update_doc
self.delete_doc(kb_file, **kwargs)
File "C:\Users\butterfly\Langchain-Chatchat\server\knowledge_base\kb_service\base.py", line 144, in delete_doc
self.do_delete_doc(kb_file, **kwargs)
File "C:\Users\butterfly\Langchain-Chatchat\server\knowledge_base\kb_service\faiss_kb_service.py", line 91, in do_delete_doc
ids = [k for k, v in vs.docstore._dict.items() if v.metadata.get("source").lower() == kb_file.filename.lower()]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\butterfly\Langchain-Chatchat\server\knowledge_base\kb_service\faiss_kb_service.py", line 91, in
环境信息 / Environment Information langchain-ChatGLM 版本/commit 号:v0.2.10 是否使用 Docker 部署(是/否):否 使用的模型(ChatGLM2-6B / Qwen-7B 等):zhipu-api 使用的 Embedding 模型(moka-ai/m3e-base 等):bge-large-zh-v1.5 使用的向量库类型 (faiss / milvus / pg_vector 等): faiss 操作系统及版本 / Operating system and version: :Windows-10-10.0.22631-SP0. Python 版本 / Python version: :3.11.8 其他相关环境信息 / Other relevant environment information:
附加信息 / Additional Information @zRzRzRzRzRzRzR 维护者你好,请问是我使用了MarkdownHeaderTextSplitter分词器的原因,导致了报错吗?
Hey @zixiaotan21, great to see you back here wrestling with the code! 🐙
Based on the error message you provided, it seems like the issue is not directly related to the MarkdownHeaderTextSplitter tokenizer. The error 'NoneType' object has no attribute 'lower' typically occurs when an operation is attempted on a None object that is expected to be a string (or another type that has the lower method).
In the context of Langchain-Chatchat, there are several places where .lower() is called on file extensions or paths, particularly in the KnowledgeFile class and the list_files_from_folder function. If any of these locations receive a None value instead of a string, it could lead to this error.
Here are a few things you could check:
- Ensure that all file paths and names are correctly retrieved and are not
Nonebefore performing operations like.lower(). - Verify that the
extattribute of aKnowledgeFileobject is always a string. The error could occur ifos.path.splitext(filename)[-1].lower()results inNonebeing assigned toself.extif the filename does not have an extension. - Check the logic in
list_files_from_folderand other places where.lower()is used to ensure that the objects being operated on are notNone.
In the faiss_kb_service.py file, the .lower() method that could potentially receive a None object is called in the do_delete_doc method of the FaissKBService class. This occurs in the line:
ids = [k for k, v in vs.docstore._dict.items() if v.metadata.get("source").lower() == kb_file.filename.lower()]
Here, v.metadata.get("source") could return None if the "source" key does not exist in the metadata dictionary, and calling .lower() on None would raise an AttributeError.
As for adding multiple files to the knowledge base, there is a specific order or process to follow. The process involves several steps such as selecting or creating a knowledge base, uploading files, configuring file processing settings, adding the files to the knowledge base, and optionally managing the files and the knowledge base afterward.
In summary, the error seems to be related to handling of file paths or extensions rather than the MarkdownHeaderTextSplitter tokenizer directly. Additional debugging or error handling around string operations and file path manipulations is recommended to resolve this issue.
For more details, you can refer to the following sources:
- server/knowledge_base/utils.py
- server/knowledge_base/kb_service/faiss_kb_service.py
- webui_pages/knowledge_base/knowledge_base.py
I hope this helps! If you need further assistance, please provide more details about the specific steps you took and the full traceback of the error.
Sources
- server/knowledge_base/utils.py
- server/knowledge_base/kb_service/faiss_kb_service.py
- webui_pages/knowledge_base/knowledge_base.py
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
不是的,这个根分词没关系
same bug, wait for the answer
同样的bug
同样的问题,分词器换回默认的之后,上传第一个文件没问题,删除也报这个错,后续也无法上传文件
同样的问题,分词器换回默认的之后,上传第一个文件没问题,删除也报这个错,后续也无法上传文件
自定义分词器可能会导致这个问题,报错信息ids = [k for k, v in vs.docstore._dict.items() if v.metadata.get("source").lower() == kb_file.filename.lower()],这个问题可能是自定义分词器时medadata["source"]没有赋予正确的值,可以在处理medadata时加上medata["source"]=self.filepath
绕过的方法是重写一个 MarkdownHeaderTextSplitter, 在调用 docs = text_splitter.split_text(docs[0].page_content)时能把metadata['source']给带进去 docs = text_splitter.split_text(docs[0].page_content, docs[0].metadata['source'])
不过这个MarkDownHeaderTextSplitter和系统结合的比较粗糙,需要自己重写一下。
UnstructuredMarkdownLoader 会把文档中的##给删掉, 这个splitter反而没有效果了
我尝试了一下,好像将faiss_kb_service.py 下面这行代码:
ids = [k for k, v in vs.docstore._dict.items() if v.metadata.get("source").lower() == kb_file.filename.lower()]
改成下面这样就可以啦:
ids = [k for k, v in vs.docstore._dict.items() if v.metadata.get("source") and v.metadata.get("source").lower() == kb_file.filename.lower()]
或者可以尝试一下这样行不行?
相同的bug
我变通了下,现在可以多次新增和删除了,修改knowledge_base/utils.py,在text_splitter.split_text后遍历docs,把metadata["source"] 设置进去,亲测有效
docs = text_splitter.split_text(docs[0].page_content)
改为
source = os.path.basename(docs[0].metadata['source'])
docs = text_splitter.split_text(docs[0].page_content)
for doc in docs:
if doc.metadata:
doc.metadata["source"] = source