camel icon indicating copy to clipboard operation
camel copied to clipboard

[BUG] 使用VectorRetriever类的process()函数无法实时向数据库中写入数据的问题

Open LT-07 opened this issue 10 months ago • 3 comments

Required prerequisites

What version of camel are you using?

0.2.22

System information

通过pip安装 pip install "camel-ai[all]==0.2.22"

Problem description

  • 实例VectorRetriever后使用process()实例函数对文件进行处理,向量化的文件存储到向量库中时,无法实时写入数据库。
  • AutoRetriever实例中也存在同样的bug,这里不做解释。

Reproducible example code

脚本1:通过vr.process()存储camel_paper.pdf的向量到 QdrantStorage数据库中,执行完毕生成storage_customized_run数据库


from camel.retrievers import VectorRetriever
from camel.embeddings import SentenceTransformerEncoder
import os
import requests

os.makedirs('local_data', exist_ok=True)

url = "https://arxiv.org/pdf/2303.17760.pdf"
response = requests.get(url)
with open('local_data/camel_paper.pdf', 'wb') as file:
    file.write(response.content)

embedding_model=SentenceTransformerEncoder(model_name='intfloat/e5-large-v2')
vr = VectorRetriever(embedding_model=embedding_model)

from camel.storages.vectordb_storages import QdrantStorage

vector_storage = QdrantStorage(
    vector_dim=embedding_model.get_output_dim(),
    collection="demo_collection",
    path="storage_customized_run",
    collection_name="论文"
)
vr.process(
    content=r"local_data\camel_paper.pdf",
    storage=vector_storage
)

脚本2:加载创建好的数据库storage_customized_run,通过vr.query()查询数据库报错

from camel.agents import ChatAgent
from camel.models import ModelFactory
from camel.types import ModelPlatformType

from dotenv import load_dotenv
import os
from camel.storages.vectordb_storages import QdrantStorage

from camel.embeddings import SentenceTransformerEncoder
embedding_model=SentenceTransformerEncoder(model_name='intfloat/e5-large-v2')

vector_storage = QdrantStorage(
    vector_dim=embedding_model.get_output_dim(),
    collection="demo_collection",
    path="storage_customized_run",
    collection_name="论文"
)
from camel.retrievers import VectorRetriever
vr = VectorRetriever(embedding_model=embedding_model,storage=vector_storage)


retrieved_info = vr.query(
    query="what is roleplaying?",
    top_k=1,
)

报错如下

ValueError: Query result is empty, please check if the vector storage is empty.
  • 报错原因:在Windows、mac系统中,使用VectorRetriever实例方法process向数据库storage_customized_run中写camel_paper.pdf的向量时,脚本1执行完成后,向量没有被实时写入数据库。导致脚本2无法立即加载。

Steps to reproduce:

1.执行脚本1 2.执行脚本2

Traceback

Traceback (most recent call last):
 File "D:\camle_muLti_aqent \04- RAG应用构建\a02-使用构建好的向量数据库加载.py ", line 23, in <module>
    retrieved_info = vr.query(
  File "D:\anaconda3\envs\camel_multi_agent\lib\site-packages\camel\retrievers\vector_retriever.py", line 231, in query
    raise ValueError(
ValueError: Query result is empty, please check if the vector storage is empty.

Expected behavior

  • 解决执行VectorRetriever实例执行process()函数后,数据无法实时写入数据库中的问题。
  • AutoRetriever实例中也存在同样的bug,一并解决。

Additional context

数据库写入问题.docx

LT-07 avatar Feb 26 '25 15:02 LT-07

Required Libraries

Make sure to install unstructured[pdf]

pip install "unstructured[pdf]"

Otherwise, you might see the following warnings (they do not cause errors, so they might go unnoticed):

/Users/subway/code/python/camel/camel/loaders/unstructured_io.py:154: UserWarning: Failed to partition the file: local_data/camel_paper.pdf
  warnings.warn(f"Failed to partition the file: {input_path}")

/Users/subway/code/python/camel/camel/retrievers/vector_retriever.py:137: UserWarning: No elements were extracted from the content: local_data/camel_paper.pdf
  warnings.warn(

Using Vector Storage

If you want to customize QdrantStorage, you need to pass it during the initialization of VectorRetriever:

vr =(embedding_model=embedding_model, storage=vector_storage)

Script 1 Fix (Mac Example)

Main change: Import vector_storage during the initialization of VectorRetriever.

from camel.retrievers import VectorRetriever
from camel.embeddings import SentenceTransformerEncoder
import os
import requests

# Create directory
os.makedirs('local_data', exist_ok=True)

# Download PDF file
url = "https://arxiv.org/pdf/2303.17760.pdf"
response = requests.get(url)
pdf_path = os.path.join("local_data", "camel_paper.pdf")
with open(pdf_path, 'wb') as file:
    file.write(response.content)

# Initialize vector storage
from camel.storages.vectordb_storages import QdrantStorage
embedding_model = SentenceTransformerEncoder(model_name='intfloat/e5-large-v2')

vector_storage = QdrantStorage(
    vector_dim=embedding_model.get_output_dim(),
    collection="demo_collection",
    path="storage_customized_run",
    collection_name="论文"
)

# Directly pass in storage during initialization
vr = VectorRetriever(embedding_model=embedding_model, storage=vector_storage)

# Process PDF
vr.process(content=pdf_path)

Image

Script 2 Output

No changes made; the output is as follows:

[
  {
    "similarity score": "0.8241451529164397",
    "content path": "local_data/camel_paper.pdf",
    "metadata": {
      "filetype": "application/pdf",
      "languages": ["eng"],
      "page_number": 1,
      "is_continuation": true
    },
    "extra_info": {},
    "text": "novel communicative agent framework named role- playing.  Our approach involves using inception prompting to guide chat  agents toward task completion while maintaining consistency with human intentions. We showcase how role-playing can be used to generate conversational data for studying the behaviors and capabilities of a society of agents, providing a valuable resource for investigating conversational language models. In particular, we conduct comprehensive studies on instruction-following"
  }
]

subway-jack avatar Mar 13 '25 07:03 subway-jack

为啥我执行了 pip install "unstructured[pdf]" 命令还是会报错呢?

Image

wuyongdi avatar Apr 02 '25 09:04 wuyongdi

Can you give me your running code?

subway-jack avatar Apr 02 '25 10:04 subway-jack