deep-searcher icon indicating copy to clipboard operation
deep-searcher copied to clipboard

If don't set the length limit to crawl4ai metadata, it maybe can cause error about milvus max length limit

Open DuskXi opened this issue 10 months ago • 0 comments

Please describe your issue in English

Describe the bug When accessing some web pages, the metadata is too long. The web pages I accessed have a large number of image entries in the data returned by crawl4ai. Even after splitting with chunks = split_docs_to_chunks(all_docs), I can see more than 700 data in the metadata ['media']['images'] for each entry. If no restrictions are imposed, some web pages that can cause a large amount of metadata will cause Milvus actual=length exceeds max length

To Reproduce

config.set_provider_config("vector_db", "Milvus", {"uri": "http://localhost:19530", "token": ""})
config.set_provider_config("web_crawler", "Crawl4AICrawler", {"browser_config": {"headless": True, "verbose": True}})
init_config(config=config)

from deepsearcher.offline_loading import load_from_website

load_from_website(
    urls=[some website can cause problem......],
)

Expected behavior

Screenshots

Environment (please complete the following information):

  • OS: Windows
  • deep-searcher requirements.txt
  • Version latest Docker Milvus

Additional context Add any other context about the problem here.

error:

2025-03-06 21:08:37,910 - CRITICAL - fail to insert data, error info: <MilvusException: (code=1100, message=the length (421593) of json field (metadata) exceeds max length (65536): invalid parameter[expected=valid length json string][actual=length exceeds max length])>
Traceback (most recent call last):
  File "W:\work\deep-searcher\deepsearcher\vector_db\milvus.py", line 98, in insert_data
    self.client.insert(collection_name=collection, data=batch_data)
  File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\milvus_client\milvus_client.py", line 231, in insert
    raise ex from ex
  File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\milvus_client\milvus_client.py", line 227, in insert
    res = conn.insert_rows(
          ^^^^^^^^^^^^^^^^^
  File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\decorators.py", line 141, in handler
    raise e from e
  File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\decorators.py", line 137, in handler
  File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\decorators.py", line 137, in handler
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\decorators.py", line 176, in handler
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\decorators.py", line 116, in handler
    raise e from e
  File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\decorators.py", line 86, in handler
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\client\grpc_handler.py", line 529, in insert_rows
    check_status(resp.status)
  File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\client\utils.py", line 64, in check_status
    raise MilvusException(status.code, status.reason, status.error_code)
pymilvus.exceptions.MilvusException: <MilvusException: (code=1100, message=the length (421593) of json field (metadata) exceeds max length (65536): invalid parameter[expected=valid length json string][actual=length exceeds max length])>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "W:\work\deep-searcher\deepsearcher\vector_db\milvus.py", line 100, in insert_data
    log.critical(f"fail to insert data, error info: {e}")
  File "W:\work\deep-searcher\deepsearcher\tools\log.py", line 89, in critical
    raise RuntimeError(message)
RuntimeError: fail to insert data, error info: <MilvusException: (code=1100, message=the length (421593) of json field (metadata) exceeds max length (65536): invalid parameter[expected=valid length json string][actual=length exceeds max length])>

DuskXi avatar Mar 06 '25 13:03 DuskXi