[BUG] 使用VectorRetriever类的process()函数无法实时向数据库中写入数据的问题
Required prerequisites
- [x] I have read the documentation https://camel-ai.github.io/camel/camel.html.
- [x] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
- [x] Consider asking first in a Discussion.
What version of camel are you using?
0.2.22
System information
通过pip安装 pip install "camel-ai[all]==0.2.22"
Problem description
- 实例VectorRetriever后使用process()实例函数对文件进行处理,向量化的文件存储到向量库中时,无法实时写入数据库。
- AutoRetriever实例中也存在同样的bug,这里不做解释。
Reproducible example code
脚本1:通过vr.process()存储camel_paper.pdf的向量到 QdrantStorage数据库中,执行完毕生成storage_customized_run数据库
from camel.retrievers import VectorRetriever
from camel.embeddings import SentenceTransformerEncoder
import os
import requests
os.makedirs('local_data', exist_ok=True)
url = "https://arxiv.org/pdf/2303.17760.pdf"
response = requests.get(url)
with open('local_data/camel_paper.pdf', 'wb') as file:
file.write(response.content)
embedding_model=SentenceTransformerEncoder(model_name='intfloat/e5-large-v2')
vr = VectorRetriever(embedding_model=embedding_model)
from camel.storages.vectordb_storages import QdrantStorage
vector_storage = QdrantStorage(
vector_dim=embedding_model.get_output_dim(),
collection="demo_collection",
path="storage_customized_run",
collection_name="论文"
)
vr.process(
content=r"local_data\camel_paper.pdf",
storage=vector_storage
)
脚本2:加载创建好的数据库storage_customized_run,通过vr.query()查询数据库报错
from camel.agents import ChatAgent
from camel.models import ModelFactory
from camel.types import ModelPlatformType
from dotenv import load_dotenv
import os
from camel.storages.vectordb_storages import QdrantStorage
from camel.embeddings import SentenceTransformerEncoder
embedding_model=SentenceTransformerEncoder(model_name='intfloat/e5-large-v2')
vector_storage = QdrantStorage(
vector_dim=embedding_model.get_output_dim(),
collection="demo_collection",
path="storage_customized_run",
collection_name="论文"
)
from camel.retrievers import VectorRetriever
vr = VectorRetriever(embedding_model=embedding_model,storage=vector_storage)
retrieved_info = vr.query(
query="what is roleplaying?",
top_k=1,
)
报错如下
ValueError: Query result is empty, please check if the vector storage is empty.
- 报错原因:在Windows、mac系统中,使用VectorRetriever实例方法process向数据库storage_customized_run中写camel_paper.pdf的向量时,脚本1执行完成后,向量没有被实时写入数据库。导致脚本2无法立即加载。
Steps to reproduce:
1.执行脚本1 2.执行脚本2
Traceback
Traceback (most recent call last):
File "D:\camle_muLti_aqent \04- RAG应用构建\a02-使用构建好的向量数据库加载.py ", line 23, in <module>
retrieved_info = vr.query(
File "D:\anaconda3\envs\camel_multi_agent\lib\site-packages\camel\retrievers\vector_retriever.py", line 231, in query
raise ValueError(
ValueError: Query result is empty, please check if the vector storage is empty.
Expected behavior
- 解决执行VectorRetriever实例执行process()函数后,数据无法实时写入数据库中的问题。
- AutoRetriever实例中也存在同样的bug,一并解决。
Additional context
Required Libraries
Make sure to install unstructured[pdf]:
pip install "unstructured[pdf]"
Otherwise, you might see the following warnings (they do not cause errors, so they might go unnoticed):
/Users/subway/code/python/camel/camel/loaders/unstructured_io.py:154: UserWarning: Failed to partition the file: local_data/camel_paper.pdf
warnings.warn(f"Failed to partition the file: {input_path}")
/Users/subway/code/python/camel/camel/retrievers/vector_retriever.py:137: UserWarning: No elements were extracted from the content: local_data/camel_paper.pdf
warnings.warn(
Using Vector Storage
If you want to customize QdrantStorage, you need to pass it during the initialization of VectorRetriever:
vr =(embedding_model=embedding_model, storage=vector_storage)
Script 1 Fix (Mac Example)
Main change: Import vector_storage during the initialization of VectorRetriever.
from camel.retrievers import VectorRetriever
from camel.embeddings import SentenceTransformerEncoder
import os
import requests
# Create directory
os.makedirs('local_data', exist_ok=True)
# Download PDF file
url = "https://arxiv.org/pdf/2303.17760.pdf"
response = requests.get(url)
pdf_path = os.path.join("local_data", "camel_paper.pdf")
with open(pdf_path, 'wb') as file:
file.write(response.content)
# Initialize vector storage
from camel.storages.vectordb_storages import QdrantStorage
embedding_model = SentenceTransformerEncoder(model_name='intfloat/e5-large-v2')
vector_storage = QdrantStorage(
vector_dim=embedding_model.get_output_dim(),
collection="demo_collection",
path="storage_customized_run",
collection_name="论文"
)
# Directly pass in storage during initialization
vr = VectorRetriever(embedding_model=embedding_model, storage=vector_storage)
# Process PDF
vr.process(content=pdf_path)
Script 2 Output
No changes made; the output is as follows:
[
{
"similarity score": "0.8241451529164397",
"content path": "local_data/camel_paper.pdf",
"metadata": {
"filetype": "application/pdf",
"languages": ["eng"],
"page_number": 1,
"is_continuation": true
},
"extra_info": {},
"text": "novel communicative agent framework named role- playing. Our approach involves using inception prompting to guide chat agents toward task completion while maintaining consistency with human intentions. We showcase how role-playing can be used to generate conversational data for studying the behaviors and capabilities of a society of agents, providing a valuable resource for investigating conversational language models. In particular, we conduct comprehensive studies on instruction-following"
}
]
为啥我执行了 pip install "unstructured[pdf]" 命令还是会报错呢?
Can you give me your running code?