consume document failed - Collection field dim is 1024, but entities field dim is 0

Open kerlion opened this issue 1 year ago • 1 comments

Self Checks

[X] This is only for bug report, if you would like to ask a quesion, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] Pleas do not modify this template :) and fill in all the required fields.

Dify version

0.6.6

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Knowlege -> upload documents ->Chunk settings->Custom Segment identifier: \n\n Maximum chunk length: 500 Chunk overlap: 50 Text preprocessing rules -> Check Replace consecutive spaces, newlines and tabs

I suspect it is related to "Replace consecutive spaces, newlines and tabs"

It works fine at 0.6.4.!

✔️ Expected Behavior

Embedding complete.

❌ Actual Behavior

reported error: [2024-05-09 01:12:02,952: DEBUG/MainProcess] Prefix dict has been built successfully. [2024-05-09 01:12:03,340: DEBUG/MainProcess] Created new connection using: 0ef89dbb6b6340f98a776bee7a1e3bea [2024-05-09 01:12:03,351: DEBUG/MainProcess] Created new connection using: 4f2bd99b26314c989c3ae9d6454ecabf [2024-05-09 01:12:03,371: ERROR/MainProcess] RPC error: [insert_rows], <**ParamError: (code=1, message=Collection field dim is 1024, but entities field dim is 0**)>, <Time:{'RPC start': '2024-05-09 01:12:03.366004', 'RPC error': '2024-05-09 01:12:03.371536'}> [2024-05-09 01:12:03,371: ERROR/MainProcess] Failed to insert batch starting at entity: 0/10 [2024-05-09 01:12:03,371: ERROR/MainProcess] Failed to insert batch starting at entity: 0/10 [2024-05-09 01:12:03,378: ERROR/MainProcess] consume document failed Traceback (most recent call last): File "/app/api/core/indexing_runner.py", line 73, in run self._load( File "/app/api/core/indexing_runner.py", line 677, in _load tokens += future.result() File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/app/api/core/indexing_runner.py", line 732, in _process_chunk index_processor.load(dataset, chunk_documents, with_keywords=False) File "/app/api/core/rag/index_processor/processor/paragraph_index_processor.py", line 60, in load vector.create(documents) File "/app/api/core/rag/datasource/vdb/vector_factory.py", line 173, in create self._vector_processor.create( File "/app/api/core/rag/datasource/vdb/milvus/milvus_vector.py", line 68, in create self.add_texts(texts, embeddings) File "/app/api/core/rag/datasource/vdb/milvus/milvus_vector.py", line 94, in add_texts raise e File "/app/api/core/rag/datasource/vdb/milvus/milvus_vector.py", line 88, in add_texts ids = self._client.insert(collection_name=self._collection_name, data=batch_insert_list) File "/usr/local/lib/python3.10/site-packages/pymilvus/milvus_client/milvus_client.py", line 206, in insert raise ex from ex File "/usr/local/lib/python3.10/site-packages/pymilvus/milvus_client/milvus_client.py", line 198, in insert res = conn.insert_rows(collection_name, insert_batch, timeout=timeout) File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 127, in handler raise e from e File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 123, in handler return func(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 162, in handler return func(self, *args, **kwargs) File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 102, in handler raise e from e File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 68, in handler return func(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/pymilvus/client/grpc_handler.py", line 501, in insert_rows request = self._prepare_row_insert_request( File "/usr/local/lib/python3.10/site-packages/pymilvus/client/grpc_handler.py", line 482, in _prepare_row_insert_request return Prepare.row_insert_param( File "/usr/local/lib/python3.10/site-packages/pymilvus/client/prepare.py", line 422, in row_insert_param return cls._parse_row_request(request, fields_info, enable_dynamic, entities) File "/usr/local/lib/python3.10/site-packages/pymilvus/client/prepare.py", line 394, in _parse_row_request _, _, auto_id_loc = traverse_rows_info(fields_info, entities) File "/usr/local/lib/python3.10/site-packages/pymilvus/client/utils.py", line 215, in traverse_rows_info raise ParamError( pymilvus.exceptions.ParamError: <ParamError: (code=1, message=Collection field dim is 1024, but entities field dim is 0)>

May 09 '24 01:05 kerlion

The error you're encountering, "Collection field dim is 1024, but entities field dim is 0," indicates a mismatch between the expected dimensionality of the data and what's actually being provided. This typically means that the system expects vector data with a dimensionality of 1024 for each document, but the submitted data appears to have no dimensions, suggesting it might be empty or improperly formatted.

This issue could be related to the document processing and embedding phase, where the "Replace consecutive spaces, newlines, and tabs" setting might be affecting the output. Since you've identified a change in behavior between versions 0.6.4 and 0.6.6, it's possible that adjustments or updates in the newer version could be impacting how documents are processed and prepared for embedding.

To address this, please ensure that the documents you're trying to upload are correctly formatted and contain the expected content. Also, reviewing the custom chunk settings and how the "Replace consecutive spaces, newlines, and tabs" option is configured could help identify if there's a specific setting causing the issue. Adjusting these settings or testing with a simpler document might help isolate the problem.

If the issue persists, providing more details about the documents (format, size, content) and the exact settings used for chunking and text preprocessing could offer further insights into what might be going wrong.

Details

Sources

About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

May 09 '24 01:05 dosubot[bot]