deep-searcher
deep-searcher copied to clipboard
If don't set the length limit to crawl4ai metadata, it maybe can cause error about milvus max length limit
Please describe your issue in English
Describe the bug
When accessing some web pages, the metadata is too long. The web pages I accessed have a large number of image entries in the data returned by crawl4ai. Even after splitting with chunks = split_docs_to_chunks(all_docs), I can see more than 700 data in the metadata ['media']['images'] for each entry. If no restrictions are imposed, some web pages that can cause a large amount of metadata will cause Milvus actual=length exceeds max length
To Reproduce
config.set_provider_config("vector_db", "Milvus", {"uri": "http://localhost:19530", "token": ""})
config.set_provider_config("web_crawler", "Crawl4AICrawler", {"browser_config": {"headless": True, "verbose": True}})
init_config(config=config)
from deepsearcher.offline_loading import load_from_website
load_from_website(
urls=[some website can cause problem......],
)
Expected behavior
Screenshots
Environment (please complete the following information):
- OS: Windows
- deep-searcher requirements.txt
- Version latest Docker Milvus
Additional context Add any other context about the problem here.
error:
2025-03-06 21:08:37,910 - CRITICAL - fail to insert data, error info: <MilvusException: (code=1100, message=the length (421593) of json field (metadata) exceeds max length (65536): invalid parameter[expected=valid length json string][actual=length exceeds max length])>
Traceback (most recent call last):
File "W:\work\deep-searcher\deepsearcher\vector_db\milvus.py", line 98, in insert_data
self.client.insert(collection_name=collection, data=batch_data)
File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\milvus_client\milvus_client.py", line 231, in insert
raise ex from ex
File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\milvus_client\milvus_client.py", line 227, in insert
res = conn.insert_rows(
^^^^^^^^^^^^^^^^^
File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\decorators.py", line 141, in handler
raise e from e
File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\decorators.py", line 137, in handler
File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\decorators.py", line 137, in handler
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\decorators.py", line 176, in handler
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\decorators.py", line 116, in handler
raise e from e
File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\decorators.py", line 86, in handler
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\client\grpc_handler.py", line 529, in insert_rows
check_status(resp.status)
File "W:\work\deep-searcher\.venv\Lib\site-packages\pymilvus\client\utils.py", line 64, in check_status
raise MilvusException(status.code, status.reason, status.error_code)
pymilvus.exceptions.MilvusException: <MilvusException: (code=1100, message=the length (421593) of json field (metadata) exceeds max length (65536): invalid parameter[expected=valid length json string][actual=length exceeds max length])>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "W:\work\deep-searcher\deepsearcher\vector_db\milvus.py", line 100, in insert_data
log.critical(f"fail to insert data, error info: {e}")
File "W:\work\deep-searcher\deepsearcher\tools\log.py", line 89, in critical
raise RuntimeError(message)
RuntimeError: fail to insert data, error info: <MilvusException: (code=1100, message=the length (421593) of json field (metadata) exceeds max length (65536): invalid parameter[expected=valid length json string][actual=length exceeds max length])>