milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Replicas number is not as expected after upgrade from v2.2.3 to 2.2.0-20230310-b2ece6a5

Open zhuwenxing opened this issue 1 year ago • 12 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:v2.2.3 --> 2.2.0-20230310-b2ece6a5
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): kafka   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2023-03-13T11:01:08.474Z] =================================== FAILURES ===================================

[2023-03-13T11:01:08.474Z] _ TestActionSecondDeployment.test_check[deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] _

[2023-03-13T11:01:08.474Z] [gw3] linux -- Python 3.8.10 /usr/bin/python3.8

[2023-03-13T11:01:08.474Z] 

[2023-03-13T11:01:08.474Z] self = <test_action_second_deployment.TestActionSecondDeployment object at 0x7f7fda184dc0>

[2023-03-13T11:01:08.474Z] all_collection_name = 'deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000'

[2023-03-13T11:01:08.474Z] data_size = 3000

[2023-03-13T11:01:08.474Z] 

[2023-03-13T11:01:08.474Z]     @pytest.mark.tags(CaseLabel.L3)

[2023-03-13T11:01:08.474Z]     def test_check(self, all_collection_name, data_size):

[2023-03-13T11:01:08.474Z]         """

[2023-03-13T11:01:08.474Z]         before reinstall: create collection

[2023-03-13T11:01:08.474Z]         """

[2023-03-13T11:01:08.474Z]         self._connect()

[2023-03-13T11:01:08.474Z]         ms = MilvusSys()

[2023-03-13T11:01:08.474Z]         name = all_collection_name

[2023-03-13T11:01:08.474Z]         is_binary = False

[2023-03-13T11:01:08.474Z]         if "BIN" in name:

[2023-03-13T11:01:08.474Z]             is_binary = True

[2023-03-13T11:01:08.474Z]         collection_w, _ = self.collection_wrap.init_collection(name=name)

[2023-03-13T11:01:08.474Z]         self.collection_w = collection_w

[2023-03-13T11:01:08.474Z]         schema = collection_w.schema

[2023-03-13T11:01:08.474Z]         data_type = [field.dtype for field in schema.fields]

[2023-03-13T11:01:08.474Z]         field_name = [field.name for field in schema.fields]

[2023-03-13T11:01:08.474Z]         type_field_map = dict(zip(data_type, field_name))

[2023-03-13T11:01:08.474Z]         if is_binary:

[2023-03-13T11:01:08.474Z]             default_index_field = ct.default_binary_vec_field_name

[2023-03-13T11:01:08.474Z]             vector_index_type = "BIN_IVF_FLAT"

[2023-03-13T11:01:08.474Z]         else:

[2023-03-13T11:01:08.474Z]             default_index_field = ct.default_float_vec_field_name

[2023-03-13T11:01:08.474Z]             vector_index_type = "IVF_FLAT"

[2023-03-13T11:01:08.474Z]     

[2023-03-13T11:01:08.474Z]         binary_vector_index_types = [index.params["index_type"] for index in collection_w.indexes if

[2023-03-13T11:01:08.474Z]                                      index.field_name == type_field_map.get(100, "")]

[2023-03-13T11:01:08.474Z]         float_vector_index_types = [index.params["index_type"] for index in collection_w.indexes if

[2023-03-13T11:01:08.474Z]                                     index.field_name == type_field_map.get(101, "")]

[2023-03-13T11:01:08.474Z]         index_field_map = dict([(index.field_name, index.index_name) for index in collection_w.indexes])

[2023-03-13T11:01:08.474Z]         index_names = [index.index_name for index in collection_w.indexes]  # used to drop index

[2023-03-13T11:01:08.475Z]         vector_index_types = binary_vector_index_types + float_vector_index_types

[2023-03-13T11:01:08.475Z]         if len(vector_index_types) > 0:

[2023-03-13T11:01:08.475Z]             vector_index_type = vector_index_types[0]

[2023-03-13T11:01:08.475Z]         try:

[2023-03-13T11:01:08.475Z]             t0 = time.time()

[2023-03-13T11:01:08.475Z]             self.utility_wrap.wait_for_loading_complete(name)

[2023-03-13T11:01:08.475Z]             log.info(f"wait for {name} loading complete cost {time.time() - t0}")

[2023-03-13T11:01:08.475Z]         except Exception as e:

[2023-03-13T11:01:08.475Z]             log.error(e)

[2023-03-13T11:01:08.475Z]         # get replicas loaded

[2023-03-13T11:01:08.475Z]         try:

[2023-03-13T11:01:08.475Z]             replicas = collection_w.get_replicas(enable_traceback=False)

[2023-03-13T11:01:08.475Z]             replicas_loaded = len(replicas.groups)

[2023-03-13T11:01:08.475Z]         except Exception as e:

[2023-03-13T11:01:08.475Z]             log.error(e)

[2023-03-13T11:01:08.475Z]             replicas_loaded = 0

[2023-03-13T11:01:08.475Z]     

[2023-03-13T11:01:08.475Z]         log.info(f"collection {name} has {replicas_loaded} replicas")

[2023-03-13T11:01:08.475Z]         actual_replicas = re.search(r'replica_number_(.*?)_', name).group(1)

[2023-03-13T11:01:08.475Z] >       assert replicas_loaded == int(actual_replicas)

[2023-03-13T11:01:08.475Z] E       AssertionError: assert 0 == 2

[2023-03-13T11:01:08.475Z] E        +  where 2 = int('2')

[2023-03-13T11:01:08.475Z] 

[2023-03-13T11:01:08.475Z] testcases/test_action_second_deployment.py:119: AssertionError

[2023-03-13T11:01:08.475Z] ------------------------------ Captured log setup ------------------------------

[2023-03-13T11:01:08.475Z] [2023-03-13 10:56:17 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:39)

[2023-03-13T11:01:08.475Z] [2023-03-13 10:56:17 - INFO - ci_test]: [setup_method] Start setup test case test_check. (client_base.py:40)

[2023-03-13T11:01:08.475Z] ------------------------------ Captured log call -------------------------------

[2023-03-13T11:01:08.475Z] [2023-03-13 10:56:17 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default'], kwargs: {'host': '10.101.14.41', 'port': 19530} (api_request.py:56)

[2023-03-13T11:01:08.475Z] [2023-03-13 10:56:17 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-03-13T11:01:08.475Z] [2023-03-13 10:56:17 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000', None, 'default', 2], kwargs: {'consistency_level': 'Strong'} (api_request.py:56)

[2023-03-13T11:01:08.475Z] [2023-03-13 10:56:17 - DEBUG - ci_test]: (api_response) : <Collection>:

[2023-03-13T11:01:08.475Z] -------------

[2023-03-13T11:01:08.475Z] <name>: deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000

[2023-03-13T11:01:08.475Z] <partitions>: [{"name": "_default", "collection_name": "deploy_test_index_type_BIN_......  (api_request.py:31)

[2023-03-13T11:01:08.475Z] [2023-03-13 10:56:17 - DEBUG - ci_test]: (api_request)  : [wait_for_loading_complete] args: ['deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000', None, 20, 'default'], kwargs: {} (api_request.py:56)

[2023-03-13T11:01:08.475Z] [2023-03-13 10:56:17 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-03-13T11:01:08.475Z] [2023-03-13 10:56:17 - INFO - ci_test]: wait for deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000 loading complete cost 0.0041506290435791016 (test_action_second_deployment.py:106)

[2023-03-13T11:01:08.475Z] [2023-03-13 10:56:17 - ERROR - pymilvus.decorators]: RPC error: [get_replicas], <MilvusException: (code=15, message=failed to get replica info, err=failed to get shard leader for shard collectionID:440061701486776054 channelName:"by-dev-rootcoord-dml_132_440061701486776054v0" seek_position:<channel_name:"by-dev-rootcoord-dml_132_440061701486776054v0" msgID:"\021\000\000\000\000\000\000\000" msgGroup:"by-dev-dataNode-24-by-dev-rootcoord-dml_132_440061701486776054v0" timestamp:440061817125601281 > unflushedSegmentIds:440061701486776177 , the collection not loaded or leader is offline[NodeNotFound(0)])>, <Time:{'RPC start': '2023-03-13 10:56:17.850885', 'RPC error': '2023-03-13 10:56:17.857130'}> (decorators.py:108)

[2023-03-13T11:01:08.475Z] [2023-03-13 10:56:17 - ERROR - ci_test]: <MilvusException: (code=15, message=failed to get replica info, err=failed to get shard leader for shard collectionID:440061701486776054 channelName:"by-dev-rootcoord-dml_132_440061701486776054v0" seek_position:<channel_name:"by-dev-rootcoord-dml_132_440061701486776054v0" msgID:"\021\000\000\000\000\000\000\000" msgGroup:"by-dev-dataNode-24-by-dev-rootcoord-dml_132_440061701486776054v0" timestamp:440061817125601281 > unflushedSegmentIds:440061701486776177 , the collection not loaded or leader is offline[NodeNotFound(0)])> (test_action_second_deployment.py:114)

[2023-03-13T11:01:08.475Z] [2023-03-13 10:56:17 - INFO - ci_test]: collection deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000 has 0 replicas (test_action_second_deployment.py:117)

[2023-03-13T11:01:08.475Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2023-03-13T11:01:08.475Z] =========================== short test summary info ============================

[2023-03-13T11:01:08.475Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2023-03-13T11:01:08.475Z]  +  where 2 = int('2')

[2023-03-13T11:01:08.475Z] ================== 1 failed, 49 passed in 1413.72s (0:23:33) ===================

Expected Behavior

this collection is loaded with 2 replicas. After the upgrade, the replicas should also be 2

Steps To Reproduce

No response

Milvus Log

milvus mode: cluster deploy task: upgrade old image tag: v2.2.3 new image tag: 2.2.0-20230310-b2ece6a5 failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/522/pipeline log:

artifacts-kafka-cluster-upgrade-522-server-second-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-522-server-first-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-522-pytest-logs.tar.gz

Anything else?

No response

zhuwenxing avatar Mar 13 '23 11:03 zhuwenxing

/assign @jiaoew1991 I guess the root cause was a load failure.

/unassign

yanliang567 avatar Mar 13 '23 11:03 yanliang567


[2023-03-14T11:03:12.524Z] <name>: deploy_test_index_type_HNSW_is_compacted_not_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000

[2023-03-14T11:03:12.524Z] <partitions>: [{"name": "_default", "collection_name": "deploy_test_index_type_HNSW_is_compacted_not......  (api_request.py:31)

[2023-03-14T11:03:12.524Z] [2023-03-14 11:00:39 - DEBUG - ci_test]: (api_request)  : [wait_for_loading_complete] args: ['deploy_test_index_type_HNSW_is_compacted_not_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000', None, 20, 'default'], kwargs: {} (api_request.py:56)

[2023-03-14T11:03:12.524Z] [2023-03-14 11:00:39 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-03-14T11:03:12.524Z] [2023-03-14 11:00:39 - INFO - ci_test]: wait for deploy_test_index_type_HNSW_is_compacted_not_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000 loading complete cost 0.002306222915649414 (test_action_second_deployment.py:106)

[2023-03-14T11:03:12.524Z] [2023-03-14 11:00:39 - ERROR - pymilvus.decorators]: RPC error: [get_replicas], <MilvusException: (code=15, message=failed to get replica info, err=failed to get shard leader for shard collectionID:440084580325221241 channelName:"by-dev-rootcoord-dml_54_440084580325221241v0" seek_position:<channel_name:"by-dev-rootcoord-dml_54" msgID:"\337\013\000\000\000\000\000\000" msgGroup:"by-dev-dataNode-10-by-dev-rootcoord-dml_54_440084580325221241v0" timestamp:440084823232479233 > unflushedSegmentIds:440084580325422051 flushedSegmentIds:440084580325423835 dropped_segmentIds:440084580325221517 dropped_segmentIds:440084580325422061 dropped_segmentIds:440084580325421799 dropped_segmentIds:440084580325421959 dropped_segmentIds:440084580325421631 dropped_segmentIds:440084580325221285 , the collection not loaded or leader is offline[NodeNotFound(0)])>, <Time:{'RPC start': '2023-03-14 11:00:39.093974', 'RPC error': '2023-03-14 11:00:39.097337'}> (decorators.py:108)

[2023-03-14T11:03:12.524Z] [2023-03-14 11:00:39 - ERROR - ci_test]: <MilvusException: (code=15, message=failed to get replica info, err=failed to get shard leader for shard collectionID:440084580325221241 channelName:"by-dev-rootcoord-dml_54_440084580325221241v0" seek_position:<channel_name:"by-dev-rootcoord-dml_54" msgID:"\337\013\000\000\000\000\000\000" msgGroup:"by-dev-dataNode-10-by-dev-rootcoord-dml_54_440084580325221241v0" timestamp:440084823232479233 > unflushedSegmentIds:440084580325422051 flushedSegmentIds:440084580325423835 dropped_segmentIds:440084580325221517 dropped_segmentIds:440084580325422061 dropped_segmentIds:440084580325421799 dropped_segmentIds:440084580325421959 dropped_segmentIds:440084580325421631 dropped_segmentIds:440084580325221285 , the collection not loaded or leader is offline[NodeNotFound(0)])> (test_action_second_deployment.py:114)

[2023-03-14T11:03:12.524Z] [2023-03-14 11:00:39 - INFO - ci_test]: collection deploy_test_index_type_HNSW_is_compacted_not_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000 has 0 replicas (test_action_second_deployment.py:117)

[2023-03-14T11:03:12.524Z] _ TestActionSecondDeployment.test_check[deploy_test_index_type_HNSW_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] _

[2023-03-14T11:03:12.524Z] [gw1] linux -- Python 3.8.10 /usr/bin/python3.8

[2023-03-14T11:03:12.524Z] 

[2023-03-14T11:03:12.524Z] self = <test_action_second_deployment.TestActionSecondDeployment object at 0x7ef98992afd0>

[2023-03-14T11:03:12.524Z] all_collection_name = 'deploy_test_index_type_HNSW_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000'

[2023-03-14T11:03:12.524Z] data_size = 3000

[2023-03-14T11:03:12.524Z] 

[2023-03-14T11:03:12.524Z]     @pytest.mark.tags(CaseLabel.L3)

[2023-03-14T11:03:12.525Z]     def test_check(self, all_collection_name, data_size):

[2023-03-14T11:03:12.525Z]         """

[2023-03-14T11:03:12.525Z]         before reinstall: create collection

[2023-03-14T11:03:12.525Z]         """

[2023-03-14T11:03:12.525Z]         self._connect()

[2023-03-14T11:03:12.525Z]         ms = MilvusSys()

[2023-03-14T11:03:12.525Z]         name = all_collection_name

[2023-03-14T11:03:12.525Z]         is_binary = False

[2023-03-14T11:03:12.525Z]         if "BIN" in name:

[2023-03-14T11:03:12.525Z]             is_binary = True

[2023-03-14T11:03:12.525Z]         collection_w, _ = self.collection_wrap.init_collection(name=name)

[2023-03-14T11:03:12.525Z]         self.collection_w = collection_w

[2023-03-14T11:03:12.525Z]         schema = collection_w.schema

[2023-03-14T11:03:12.525Z]         data_type = [field.dtype for field in schema.fields]

[2023-03-14T11:03:12.525Z]         field_name = [field.name for field in schema.fields]

[2023-03-14T11:03:12.525Z]         type_field_map = dict(zip(data_type, field_name))

[2023-03-14T11:03:12.525Z]         if is_binary:

[2023-03-14T11:03:12.525Z]             default_index_field = ct.default_binary_vec_field_name

[2023-03-14T11:03:12.525Z]             vector_index_type = "BIN_IVF_FLAT"

[2023-03-14T11:03:12.525Z]         else:

[2023-03-14T11:03:12.525Z]             default_index_field = ct.default_float_vec_field_name

[2023-03-14T11:03:12.525Z]             vector_index_type = "IVF_FLAT"

[2023-03-14T11:03:12.525Z]     

[2023-03-14T11:03:12.525Z]         binary_vector_index_types = [index.params["index_type"] for index in collection_w.indexes if

[2023-03-14T11:03:12.525Z]                                      index.field_name == type_field_map.get(100, "")]

[2023-03-14T11:03:12.525Z]         float_vector_index_types = [index.params["index_type"] for index in collection_w.indexes if

[2023-03-14T11:03:12.525Z]                                     index.field_name == type_field_map.get(101, "")]

[2023-03-14T11:03:12.525Z]         index_field_map = dict([(index.field_name, index.index_name) for index in collection_w.indexes])

[2023-03-14T11:03:12.525Z]         index_names = [index.index_name for index in collection_w.indexes]  # used to drop index

[2023-03-14T11:03:12.525Z]         vector_index_types = binary_vector_index_types + float_vector_index_types

[2023-03-14T11:03:12.525Z]         if len(vector_index_types) > 0:

[2023-03-14T11:03:12.525Z]             vector_index_type = vector_index_types[0]

[2023-03-14T11:03:12.525Z]         try:

[2023-03-14T11:03:12.525Z]             t0 = time.time()

[2023-03-14T11:03:12.525Z] [get_env_variable] failed to get environment variables : 'CI_LOG_PATH', use default path : /tmp/ci_logs

[2023-03-14T11:03:12.525Z]             self.utility_wrap.wait_for_loading_complete(name)

[2023-03-14T11:03:12.525Z]             log.info(f"wait for {name} loading complete cost {time.time() - t0}")

[2023-03-14T11:03:12.525Z]         except Exception as e:

[2023-03-14T11:03:12.525Z]             log.error(e)

[2023-03-14T11:03:12.525Z]         # get replicas loaded

[2023-03-14T11:03:12.525Z]         try:

[2023-03-14T11:03:12.525Z]             replicas = collection_w.get_replicas(enable_traceback=False)

[2023-03-14T11:03:12.525Z]             replicas_loaded = len(replicas.groups)

[2023-03-14T11:03:12.525Z]         except Exception as e:

[2023-03-14T11:03:12.525Z]             log.error(e)

[2023-03-14T11:03:12.525Z]             replicas_loaded = 0

[2023-03-14T11:03:12.525Z]     

[2023-03-14T11:03:12.525Z]         log.info(f"collection {name} has {replicas_loaded} replicas")

[2023-03-14T11:03:12.525Z]         actual_replicas = re.search(r'replica_number_(.*?)_', name).group(1)

[2023-03-14T11:03:12.525Z] >       assert replicas_loaded == int(actual_replicas)

[2023-03-14T11:03:12.525Z] E       AssertionError: assert 0 == 2

[2023-03-14T11:03:12.525Z] E        +  where 2 = int('2')

[2023-03-14T11:03:12.525Z] 

[2023-03-14T11:03:12.525Z] testcases/test_action_second_deployment.py:119: AssertionError

milvus mode: cluster deploy task: upgrade old image tag: v2.2.3 new image tag: 2.2.0-20230314-3aa28506 log: artifacts-kafka-cluster-upgrade-538-server-second-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-538-server-first-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-538-pytest-logs.tar.gz

zhuwenxing avatar Mar 14 '23 11:03 zhuwenxing

image We reassign all node to one replica and cause another one don't have node. So we will failed to get replicas, because one replica has no node and shard leader.

aoiasd avatar Mar 15 '23 09:03 aoiasd

related: https://github.com/milvus-io/milvus/issues/22782

aoiasd avatar Mar 15 '23 09:03 aoiasd

/assign @weiliu1031 /unassign

jiaoew1991 avatar Mar 16 '23 01:03 jiaoew1991

it reproduces on 2.2.0-20230317-bbc21fe8

yanliang567 avatar Mar 20 '23 03:03 yanliang567

[2023-03-20T14:13:45.159Z] =================================== FAILURES ===================================

[2023-03-20T14:13:45.159Z] _ TestActionSecondDeployment.test_check[deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] _

[2023-03-20T14:13:45.159Z] [gw1] linux -- Python 3.8.10 /usr/bin/python3.8

[2023-03-20T14:13:45.159Z] 

[2023-03-20T14:13:45.159Z] self = <test_action_second_deployment.TestActionSecondDeployment object at 0x7f5197766460>

[2023-03-20T14:13:45.159Z] all_collection_name = 'deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000'

[2023-03-20T14:13:45.159Z] data_size = 3000

[2023-03-20T14:13:45.159Z] 

[2023-03-20T14:13:45.159Z]     @pytest.mark.tags(CaseLabel.L3)

[2023-03-20T14:13:45.159Z]     def test_check(self, all_collection_name, data_size):

[2023-03-20T14:13:45.159Z]         """

[2023-03-20T14:13:45.159Z]         before reinstall: create collection

[2023-03-20T14:13:45.159Z]         """

[2023-03-20T14:13:45.159Z]         self._connect()

[2023-03-20T14:13:45.159Z]         ms = MilvusSys()

[2023-03-20T14:13:45.159Z]         name = all_collection_name

[2023-03-20T14:13:45.159Z]         is_binary = False

[2023-03-20T14:13:45.159Z]         if "BIN" in name:

[2023-03-20T14:13:45.159Z]             is_binary = True

[2023-03-20T14:13:45.159Z]         collection_w, _ = self.collection_wrap.init_collection(name=name)

[2023-03-20T14:13:45.159Z]         self.collection_w = collection_w

[2023-03-20T14:13:45.159Z]         schema = collection_w.schema

[2023-03-20T14:13:45.159Z]         data_type = [field.dtype for field in schema.fields]

[2023-03-20T14:13:45.159Z]         field_name = [field.name for field in schema.fields]

[2023-03-20T14:13:45.159Z]         type_field_map = dict(zip(data_type, field_name))

[2023-03-20T14:13:45.159Z]         if is_binary:

[2023-03-20T14:13:45.159Z]             default_index_field = ct.default_binary_vec_field_name

[2023-03-20T14:13:45.159Z]             vector_index_type = "BIN_IVF_FLAT"

[2023-03-20T14:13:45.159Z]         else:

[2023-03-20T14:13:45.159Z]             default_index_field = ct.default_float_vec_field_name

[2023-03-20T14:13:45.159Z]             vector_index_type = "IVF_FLAT"

[2023-03-20T14:13:45.159Z]     

[2023-03-20T14:13:45.159Z]         binary_vector_index_types = [index.params["index_type"] for index in collection_w.indexes if

[2023-03-20T14:13:45.159Z]                                      index.field_name == type_field_map.get(100, "")]

[2023-03-20T14:13:45.159Z]         float_vector_index_types = [index.params["index_type"] for index in collection_w.indexes if

[2023-03-20T14:13:45.159Z]                                     index.field_name == type_field_map.get(101, "")]

[2023-03-20T14:13:45.159Z]         index_field_map = dict([(index.field_name, index.index_name) for index in collection_w.indexes])

[2023-03-20T14:13:45.159Z]         index_names = [index.index_name for index in collection_w.indexes]  # used to drop index

[2023-03-20T14:13:45.159Z]         vector_index_types = binary_vector_index_types + float_vector_index_types

[2023-03-20T14:13:45.159Z]         if len(vector_index_types) > 0:

[2023-03-20T14:13:45.159Z]             vector_index_type = vector_index_types[0]

[2023-03-20T14:13:45.159Z]         try:

[2023-03-20T14:13:45.159Z]             t0 = time.time()

[2023-03-20T14:13:45.159Z]             self.utility_wrap.wait_for_loading_complete(name)

[2023-03-20T14:13:45.159Z]             log.info(f"wait for {name} loading complete cost {time.time() - t0}")

[2023-03-20T14:13:45.159Z]         except Exception as e:

[2023-03-20T14:13:45.159Z]             log.error(e)

[2023-03-20T14:13:45.159Z]         # get replicas loaded

[2023-03-20T14:13:45.159Z]         try:

[2023-03-20T14:13:45.159Z]             replicas = collection_w.get_replicas(enable_traceback=False)

[2023-03-20T14:13:45.159Z]             replicas_loaded = len(replicas.groups)

[2023-03-20T14:13:45.159Z]         except Exception as e:

[2023-03-20T14:13:45.159Z]             log.error(e)

[2023-03-20T14:13:45.159Z]             replicas_loaded = 0

[2023-03-20T14:13:45.159Z]     

[2023-03-20T14:13:45.159Z]         log.info(f"collection {name} has {replicas_loaded} replicas")

[2023-03-20T14:13:45.159Z]         actual_replicas = re.search(r'replica_number_(.*?)_', name).group(1)

[2023-03-20T14:13:45.159Z] >       assert replicas_loaded == int(actual_replicas)

[2023-03-20T14:13:45.159Z] E       AssertionError: assert 0 == 2

[2023-03-20T14:13:45.159Z] E        +  where 2 = int('2')

[2023-03-20T14:13:45.159Z] 

[2023-03-20T14:13:45.159Z] testcases/test_action_second_deployment.py:119: AssertionError

[2023-03-20T14:13:45.159Z] ------------------------------ Captured log setup ------------------------------

[2023-03-20T14:13:45.159Z] [2023-03-20 13:53:26 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:39)

[2023-03-20T14:13:45.159Z] [2023-03-20 13:53:26 - INFO - ci_test]: [setup_method] Start setup test case test_check. (client_base.py:40)

[2023-03-20T14:13:45.159Z] ------------------------------ Captured log call -------------------------------

[2023-03-20T14:13:45.159Z] [2023-03-20 13:53:26 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default'], kwargs: {'host': '10.101.185.165', 'port': 19530} (api_request.py:56)

[2023-03-20T14:13:45.159Z] [2023-03-20 13:53:26 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-03-20T14:13:45.159Z] [2023-03-20 13:53:26 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000', None, 'default', 2], kwargs: {'consistency_level': 'Strong'} (api_request.py:56)

[2023-03-20T14:13:45.159Z] [2023-03-20 13:53:26 - DEBUG - ci_test]: (api_response) : <Collection>:

[2023-03-20T14:13:45.159Z] -------------

[2023-03-20T14:13:45.159Z] <name>: deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000

[2023-03-20T14:13:45.159Z] <partitions>: [{"name": "_default", "collection_name": "deploy_test_index_type_BIN_IVF_FLAT_......  (api_request.py:31)

[2023-03-20T14:13:45.159Z] [2023-03-20 13:53:26 - DEBUG - ci_test]: (api_request)  : [wait_for_loading_complete] args: ['deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000', None, 20, 'default'], kwargs: {} (api_request.py:56)

[2023-03-20T14:13:45.159Z] [2023-03-20 13:53:26 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-03-20T14:13:45.159Z] [2023-03-20 13:53:26 - INFO - ci_test]: wait for deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000 loading complete cost 0.0013630390167236328 (test_action_second_deployment.py:106)

[2023-03-20T14:13:45.159Z] [2023-03-20 13:53:26 - ERROR - pymilvus.decorators]: RPC error: [get_replicas], <MilvusException: (code=15, message=failed to get replica info, err=failed to get shard leader for shard collectionID:440222522875749049 channelName:"by-dev-rootcoord-dml_131_440222522875749049v1" seek_position:<channel_name:"by-dev-rootcoord-dml_131_440222522875749049v1" msgID:"\326\000\000\000\000\000\000\000" msgGroup:"by-dev-dataNode-6-by-dev-rootcoord-dml_131_440222522875749049v1" timestamp:440222656096894978 > flushedSegmentIds:440222522875750116 dropped_segmentIds:440222522875750009 dropped_segmentIds:440222522875749617 dropped_segmentIds:440222522875749741 dropped_segmentIds:440222522875749083 dropped_segmentIds:440222522875749986 dropped_segmentIds:440222522875749540 , the collection not loaded or leader is offline[NodeNotFound(0)])>, <Time:{'RPC start': '2023-03-20 13:53:26.895052', 'RPC error': '2023-03-20 13:53:26.897301'}> (decorators.py:108)

[2023-03-20T14:13:45.159Z] [2023-03-20 13:53:26 - ERROR - ci_test]: <MilvusException: (code=15, message=failed to get replica info, err=failed to get shard leader for shard collectionID:440222522875749049 channelName:"by-dev-rootcoord-dml_131_440222522875749049v1" seek_position:<channel_name:"by-dev-rootcoord-dml_131_440222522875749049v1" msgID:"\326\000\000\000\000\000\000\000" msgGroup:"by-dev-dataNode-6-by-dev-rootcoord-dml_131_440222522875749049v1" timestamp:440222656096894978 > flushedSegmentIds:440222522875750116 dropped_segmentIds:440222522875750009 dropped_segmentIds:440222522875749617 dropped_segmentIds:440222522875749741 dropped_segmentIds:440222522875749083 dropped_segmentIds:440222522875749986 dropped_segmentIds:440222522875749540 , the collection not loaded or leader is offline[NodeNotFound(0)])> (test_action_second_deployment.py:114)

[2023-03-20T14:13:45.159Z] [2023-03-20 13:53:26 - INFO - ci_test]: collection deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000 has 0 replicas (test_action_second_deployment.py:117)

[2023-03-20T14:13:45.159Z] _ TestActionSecondDeployment.test_check[deploy_test_index_type_HNSW_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] _

[2023-03-20T14:13:45.159Z] [gw1] linux -- Python 3.8.10 /usr/bin/python3.8

[2023-03-20T14:13:45.159Z] 

[2023-03-20T14:13:45.159Z] self = <test_action_second_deployment.TestActionSecondDeployment object at 0x7f5197766c40>

[2023-03-20T14:13:45.159Z] all_collection_name = 'deploy_test_index_type_HNSW_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000'

[2023-03-20T14:13:45.159Z] data_size = 3000

[2023-03-20T14:13:45.159Z] 

[2023-03-20T14:13:45.159Z]     @pytest.mark.tags(CaseLabel.L3)

[2023-03-20T14:13:45.159Z]     def test_check(self, all_collection_name, data_size):

[2023-03-20T14:13:45.159Z]         """

[2023-03-20T14:13:45.159Z]         before reinstall: create collection

[2023-03-20T14:13:45.159Z]         """

[2023-03-20T14:13:45.159Z]         self._connect()

[2023-03-20T14:13:45.159Z]         ms = MilvusSys()

[2023-03-20T14:13:45.159Z]         name = all_collection_name

[2023-03-20T14:13:45.159Z]         is_binary = False

[2023-03-20T14:13:45.159Z]         if "BIN" in name:

[2023-03-20T14:13:45.159Z]             is_binary = True

[2023-03-20T14:13:45.160Z] [get_env_variable] failed to get environment variables : 'CI_LOG_PATH', use default path : /tmp/ci_logs

[2023-03-20T14:13:45.160Z]         collection_w, _ = self.collection_wrap.init_collection(name=name)

[2023-03-20T14:13:45.160Z]         self.collection_w = collection_w

[2023-03-20T14:13:45.160Z]         schema = collection_w.schema

[2023-03-20T14:13:45.160Z]         data_type = [field.dtype for field in schema.fields]

[2023-03-20T14:13:45.160Z]         field_name = [field.name for field in schema.fields]

[2023-03-20T14:13:45.160Z]         type_field_map = dict(zip(data_type, field_name))

[2023-03-20T14:13:45.160Z]         if is_binary:

[2023-03-20T14:13:45.160Z]             default_index_field = ct.default_binary_vec_field_name

[2023-03-20T14:13:45.160Z]             vector_index_type = "BIN_IVF_FLAT"

[2023-03-20T14:13:45.160Z]         else:

[2023-03-20T14:13:45.160Z]             default_index_field = ct.default_float_vec_field_name

[2023-03-20T14:13:45.160Z]             vector_index_type = "IVF_FLAT"

[2023-03-20T14:13:45.160Z]     

[2023-03-20T14:13:45.160Z]         binary_vector_index_types = [index.params["index_type"] for index in collection_w.indexes if

[2023-03-20T14:13:45.160Z]                                      index.field_name == type_field_map.get(100, "")]

[2023-03-20T14:13:45.160Z]         float_vector_index_types = [index.params["index_type"] for index in collection_w.indexes if

[2023-03-20T14:13:45.160Z]                                     index.field_name == type_field_map.get(101, "")]

[2023-03-20T14:13:45.160Z]         index_field_map = dict([(index.field_name, index.index_name) for index in collection_w.indexes])

[2023-03-20T14:13:45.160Z]         index_names = [index.index_name for index in collection_w.indexes]  # used to drop index

[2023-03-20T14:13:45.160Z]         vector_index_types = binary_vector_index_types + float_vector_index_types

[2023-03-20T14:13:45.160Z]         if len(vector_index_types) > 0:

[2023-03-20T14:13:45.160Z]             vector_index_type = vector_index_types[0]

[2023-03-20T14:13:45.160Z]         try:

[2023-03-20T14:13:45.160Z]             t0 = time.time()

[2023-03-20T14:13:45.160Z]             self.utility_wrap.wait_for_loading_complete(name)

[2023-03-20T14:13:45.160Z]             log.info(f"wait for {name} loading complete cost {time.time() - t0}")

[2023-03-20T14:13:45.160Z]         except Exception as e:

[2023-03-20T14:13:45.160Z]             log.error(e)

[2023-03-20T14:13:45.160Z]         # get replicas loaded

[2023-03-20T14:13:45.160Z]         try:

[2023-03-20T14:13:45.160Z]             replicas = collection_w.get_replicas(enable_traceback=False)

[2023-03-20T14:13:45.160Z]             replicas_loaded = len(replicas.groups)

[2023-03-20T14:13:45.160Z]         except Exception as e:

[2023-03-20T14:13:45.160Z]             log.error(e)

[2023-03-20T14:13:45.160Z]             replicas_loaded = 0

[2023-03-20T14:13:45.160Z]     

[2023-03-20T14:13:45.160Z]         log.info(f"collection {name} has {replicas_loaded} replicas")

[2023-03-20T14:13:45.160Z]         actual_replicas = re.search(r'replica_number_(.*?)_', name).group(1)

[2023-03-20T14:13:45.160Z] >       assert replicas_loaded == int(actual_replicas)

[2023-03-20T14:13:45.160Z] E       AssertionError: assert 0 == 2

[2023-03-20T14:13:45.160Z] E        +  where 2 = int('2')

[2023-03-20T14:13:45.160Z] 

[2023-03-20T14:13:45.160Z] testcases/test_action_second_deployment.py:119: AssertionError

[2023-03-20T14:13:45.160Z] ------------------------------ Captured log setup ------------------------------

[2023-03-20T14:13:45.160Z] [2023-03-20 14:08:40 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:39)

[2023-03-20T14:13:45.160Z] [2023-03-20 14:08:40 - INFO - ci_test]: [setup_method] Start setup test case test_check. (client_base.py:40)

[2023-03-20T14:13:45.160Z] ------------------------------ Captured log call -------------------------------

[2023-03-20T14:13:45.160Z] [2023-03-20 14:08:40 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default'], kwargs: {'host': '10.101.185.165', 'port': 19530} (api_request.py:56)

[2023-03-20T14:13:45.160Z] [2023-03-20 14:08:40 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-03-20T14:13:45.160Z] [2023-03-20 14:08:40 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['deploy_test_index_type_HNSW_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000', None, 'default', 2], kwargs: {'consistency_level': 'Strong'} (api_request.py:56)

[2023-03-20T14:13:45.160Z] [2023-03-20 14:08:40 - DEBUG - ci_test]: (api_response) : <Collection>:

[2023-03-20T14:13:45.160Z] -------------

[2023-03-20T14:13:45.160Z] <name>: deploy_test_index_type_HNSW_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000

[2023-03-20T14:13:45.160Z] <partitions>: [{"name": "_default", "collection_name": "deploy_test_index_type_HNSW_is_compacted_is_......  (api_request.py:31)

[2023-03-20T14:13:45.160Z] [2023-03-20 14:08:40 - DEBUG - ci_test]: (api_request)  : [wait_for_loading_complete] args: ['deploy_test_index_type_HNSW_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000', None, 20, 'default'], kwargs: {} (api_request.py:56)

[2023-03-20T14:13:45.160Z] [2023-03-20 14:08:40 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-03-20T14:13:45.160Z] [2023-03-20 14:08:40 - INFO - ci_test]: wait for deploy_test_index_type_HNSW_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000 loading complete cost 0.0014719963073730469 (test_action_second_deployment.py:106)

[2023-03-20T14:13:45.160Z] [2023-03-20 14:08:40 - ERROR - pymilvus.decorators]: RPC error: [get_replicas], <MilvusException: (code=15, message=failed to get replica info, err=failed to get shard leader for shard collectionID:440222522875142971 channelName:"by-dev-rootcoord-dml_54_440222522875142971v0" seek_position:<channel_name:"by-dev-rootcoord-dml_54_440222522875142971v0" msgID:"9\n\000\000\000\000\000\000" msgGroup:"by-dev-dataNode-24-by-dev-rootcoord-dml_54_440222522875142971v0" timestamp:440222776608686082 > unflushedSegmentIds:440222522875343687 flushedSegmentIds:440222522875344951 flushedSegmentIds:440222522875343680 dropped_segmentIds:440222522875343330 dropped_segmentIds:440222522875343225 dropped_segmentIds:440222522875343569 dropped_segmentIds:440222522875142985 dropped_segmentIds:440222522875343444 , the collection not loaded or leader is offline[NodeNotFound(0)])>, <Time:{'RPC start': '2023-03-20 14:08:40.719219', 'RPC error': '2023-03-20 14:08:40.721647'}> (decorators.py:108)

[2023-03-20T14:13:45.160Z] [2023-03-20 14:08:40 - ERROR - ci_test]: <MilvusException: (code=15, message=failed to get replica info, err=failed to get shard leader for shard collectionID:440222522875142971 channelName:"by-dev-rootcoord-dml_54_440222522875142971v0" seek_position:<channel_name:"by-dev-rootcoord-dml_54_440222522875142971v0" msgID:"9\n\000\000\000\000\000\000" msgGroup:"by-dev-dataNode-24-by-dev-rootcoord-dml_54_440222522875142971v0" timestamp:440222776608686082 > unflushedSegmentIds:440222522875343687 flushedSegmentIds:440222522875344951 flushedSegmentIds:440222522875343680 dropped_segmentIds:440222522875343330 dropped_segmentIds:440222522875343225 dropped_segmentIds:440222522875343569 dropped_segmentIds:440222522875142985 dropped_segmentIds:440222522875343444 , the collection not loaded or leader is offline[NodeNotFound(0)])> (test_action_second_deployment.py:114)

[2023-03-20T14:13:45.160Z] [2023-03-20 14:08:40 - INFO - ci_test]: collection deploy_test_index_type_HNSW_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000 has 0 replicas (test_action_second_deployment.py:117)

[2023-03-20T14:13:45.160Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2023-03-20T14:13:45.160Z] =========================== short test summary info ============================

[2023-03-20T14:13:45.160Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2023-03-20T14:13:45.160Z]  +  where 2 = int('2')

[2023-03-20T14:13:45.160Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_HNSW_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2023-03-20T14:13:45.160Z]  +  where 2 = int('2')

[2023-03-20T14:13:45.160Z] ================== 2 failed, 48 passed in 4442.39s (1:14:02) ===================

v2.2.3 -->2.2.0-20230320-61692278 failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/606/pipeline log: artifacts-kafka-cluster-upgrade-606-server-second-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-606-server-first-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-606-pytest-logs.tar.gz

zhuwenxing avatar Mar 21 '23 02:03 zhuwenxing

the root cause is load failed. and came from two problems:

  1. all nodes has been assign to one of replicas during rolling upgrade. related to #22782, and WIP
  2. cause pass pChannel name to vChannel, which cause failed to consume from mq. already fixed by #22721

weiliu1031 avatar Mar 24 '23 03:03 weiliu1031

@zhuwenxing please verify the second part first.

weiliu1031 avatar Mar 24 '23 03:03 weiliu1031

It still reproduced in 2.2.0-20230324-a59dc9cb @weiliu1031 PTAL failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_for_release_cron/detail/deploy_test_for_release_cron/68/pipeline

[2023-03-26T11:57:52.001Z] <name>: deploy_test_index_type_HNSW_is_compacted_not_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000

[2023-03-26T11:57:52.001Z] <partitions>: [{"name": "_default", "collection_name": "deploy_test_index_type_HNSW_is_compacted_no......  (api_request.py:31)

[2023-03-26T11:57:52.001Z] [2023-03-26 11:55:27 - DEBUG - ci_test]: (api_request)  : [wait_for_loading_complete] args: ['deploy_test_index_type_HNSW_is_compacted_not_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000', None, 20, 'default'], kwargs: {} (api_request.py:56)

[2023-03-26T11:57:52.001Z] [2023-03-26 11:55:27 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-03-26T11:57:52.001Z] [2023-03-26 11:55:27 - INFO - ci_test]: wait for deploy_test_index_type_HNSW_is_compacted_not_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000 loading complete cost 0.002002239227294922 (test_action_second_deployment.py:106)

[2023-03-26T11:57:52.001Z] [2023-03-26 11:55:27 - ERROR - pymilvus.decorators]: RPC error: [get_replicas], <MilvusException: (code=15, message=failed to get replica info, err=failed to get shard leader for shard collectionID:440357182843681219 channelName:"by-dev-rootcoord-dml_86_440357182843681219v0" seek_position:<channel_name:"by-dev-rootcoord-dml_86_440357182843681219v0" msgID:"\010V\020\213\001\030\000 \000" msgGroup:"by-dev-dataNode-13-by-dev-rootcoord-dml_86_440357182843681219v0" timestamp:440357282411446273 > unflushedSegmentIds:440357182843681806 flushedSegmentIds:440357182843882764 dropped_segmentIds:440357182843681332 dropped_segmentIds:440357182843681406 dropped_segmentIds:440357182843681679 dropped_segmentIds:440357182843681669 dropped_segmentIds:440357182843681230 dropped_segmentIds:440357182843681623 , the collection not loaded or leader is offline[NodeNotFound(0)])>, <Time:{'RPC start': '2023-03-26 11:55:27.284975', 'RPC error': '2023-03-26 11:55:27.287056'}> (decorators.py:108)

[2023-03-26T11:57:52.001Z] [2023-03-26 11:55:27 - ERROR - ci_test]: <MilvusException: (code=15, message=failed to get replica info, err=failed to get shard leader for shard collectionID:440357182843681219 channelName:"by-dev-rootcoord-dml_86_440357182843681219v0" seek_position:<channel_name:"by-dev-rootcoord-dml_86_440357182843681219v0" msgID:"\010V\020\213\001\030\000 \000" msgGroup:"by-dev-dataNode-13-by-dev-rootcoord-dml_86_440357182843681219v0" timestamp:440357282411446273 > unflushedSegmentIds:440357182843681806 flushedSegmentIds:440357182843882764 dropped_segmentIds:440357182843681332 dropped_segmentIds:440357182843681406 dropped_segmentIds:440357182843681679 dropped_segmentIds:440357182843681669 dropped_segmentIds:440357182843681230 dropped_segmentIds:440357182843681623 , the collection not loaded or leader is offline[NodeNotFound(0)])> (test_action_second_deployment.py:114)

[2023-03-26T11:57:52.001Z] [2023-03-26 11:55:27 - INFO - ci_test]: collection deploy_test_index_type_HNSW_is_compacted_not_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000 has 0 replicas (test_action_second_deployment.py:117)

[2023-03-26T11:57:52.001Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2023-03-26T11:57:52.001Z] =========================== short test summary info ============================

[2023-03-26T11:57:52.001Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_BIN_IVF_FLAT_is_compacted_not_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2023-03-26T11:57:52.001Z]  +  where 2 = int('2')

[2023-03-26T11:57:52.001Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_HNSW_is_compacted_not_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2023-03-26T11:57:52.001Z]  +  where 2 = int('2')

[2023-03-26T11:57:52.001Z] =================== 2 failed, 48 passed in 955.31s (0:15:55) ===================

log: artifacts-pulsar-cluster-upgrade-68-server-logs.tar.gz

artifacts-pulsar-cluster-upgrade-68-pytest-logs.tar.gz

zhuwenxing avatar Mar 27 '23 06:03 zhuwenxing

https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_for_release_cron/detail/deploy_test_for_release_cron/68/pipeline

base on the log, the second problems mentioned above already fixed, and the first problem will need more design work and WIP

weiliu1031 avatar Mar 27 '23 08:03 weiliu1031

It also reproduced when v2.2.5 --> master-latest failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_cron/detail/deploy_test_kafka_cron/688/pipeline/

[2023-04-25T11:12:49.499Z] ------------------------------ Captured log call -------------------------------

[2023-04-25T11:12:49.499Z] [get_env_variable] failed to get environment variables : 'CI_LOG_PATH', use default path : /tmp/ci_logs

[2023-04-25T11:12:49.499Z] [2023-04-25 10:56:10 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default'], kwargs: {'host': '10.101.49.137', 'port': 19530} (api_request.py:56)

[2023-04-25T11:12:49.499Z] [2023-04-25 10:56:10 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-04-25T11:12:49.499Z] [2023-04-25 10:56:10 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000', None, 'default', 2], kwargs: {'consistency_level': 'Strong'} (api_request.py:56)

[2023-04-25T11:12:49.499Z] [2023-04-25 10:56:10 - DEBUG - ci_test]: (api_response) : <Collection>:

[2023-04-25T11:12:49.499Z] -------------

[2023-04-25T11:12:49.499Z] <name>: deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000

[2023-04-25T11:12:49.499Z] <partitions>: [{"name": "_default", "collection_name": "deploy_test_index_type_BIN_......  (api_request.py:31)

[2023-04-25T11:12:49.499Z] [2023-04-25 10:56:10 - DEBUG - ci_test]: (api_request)  : [wait_for_loading_complete] args: ['deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000', None, 20, 'default'], kwargs: {} (api_request.py:56)

[2023-04-25T11:12:49.499Z] [2023-04-25 10:56:10 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-04-25T11:12:49.499Z] [2023-04-25 10:56:10 - INFO - ci_test]: wait for deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000 loading complete cost 0.0013170242309570312 (test_action_second_deployment.py:106)

[2023-04-25T11:12:49.499Z] [2023-04-25 10:56:10 - ERROR - pymilvus.decorators]: RPC error: [get_replicas], <MilvusException: (code=15, message=failed to get replica info, err=failed to get shard leader for shard collectionID:441035718610425131 channelName:"by-dev-rootcoord-dml_98_441035718610425131v0" seek_position:<channel_name:"by-dev-rootcoord-dml_98_441035718610425131v0" msgID:"1\000\000\000\000\000\000\000" msgGroup:"by-dev-dataNode-10-by-dev-rootcoord-dml_98_441035718610425131v0" timestamp:441035880914747392 > unflushedSegmentIds:441035718610425358 , the collection not loaded or leader is offline[NodeNotFound(0)])>, <Time:{'RPC start': '2023-04-25 10:56:10.696751', 'RPC error': '2023-04-25 10:56:10.699063'}> (decorators.py:108)

[2023-04-25T11:12:49.499Z] [2023-04-25 10:56:10 - ERROR - ci_test]: <MilvusException: (code=15, message=failed to get replica info, err=failed to get shard leader for shard collectionID:441035718610425131 channelName:"by-dev-rootcoord-dml_98_441035718610425131v0" seek_position:<channel_name:"by-dev-rootcoord-dml_98_441035718610425131v0" msgID:"1\000\000\000\000\000\000\000" msgGroup:"by-dev-dataNode-10-by-dev-rootcoord-dml_98_441035718610425131v0" timestamp:441035880914747392 > unflushedSegmentIds:441035718610425358 , the collection not loaded or leader is offline[NodeNotFound(0)])> (test_action_second_deployment.py:114)

[2023-04-25T11:12:49.499Z] [2023-04-25 10:56:10 - INFO - ci_test]: collection deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000 has 0 replicas (test_action_second_deployment.py:117)

[2023-04-25T11:12:49.499Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2023-04-25T11:12:49.499Z] =========================== short test summary info ============================

[2023-04-25T11:12:49.499Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_HNSW_is_compacted_not_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2023-04-25T11:12:49.499Z]  +  where 2 = int('2')

[2023-04-25T11:12:49.499Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_HNSW_is_compacted_not_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2023-04-25T11:12:49.499Z]  +  where 2 = int('2')

[2023-04-25T11:12:49.499Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_HNSW_is_compacted_not_compacted_segment_status_only_growing_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2023-04-25T11:12:49.499Z]  +  where 2 = int('2')

[2023-04-25T11:12:49.499Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_HNSW_is_compacted_not_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2023-04-25T11:12:49.499Z]  +  where 2 = int('2')

[2023-04-25T11:12:49.499Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_HNSW_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2023-04-25T11:12:49.499Z]  +  where 2 = int('2')

[2023-04-25T11:12:49.499Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_BIN_IVF_FLAT_is_compacted_not_compacted_segment_status_only_growing_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2023-04-25T11:12:49.499Z]  +  where 2 = int('2')

[2023-04-25T11:12:49.499Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2023-04-25T11:12:49.499Z]  +  where 2 = int('2')

[2023-04-25T11:12:49.499Z] ================== 7 failed, 43 passed in 1680.76s (0:28:00) ===================

log: artifacts-kafka-cluster-upgrade-688-server-second-deployment-logs.tar.gz

artifacts-kafka-cluster-upgrade-688-server-first-deployment-logs.tar.gz

artifacts-kafka-cluster-upgrade-688-pytest-logs.tar.gz

zhuwenxing avatar Apr 26 '23 03:04 zhuwenxing

should be fixed in #23415 #23626, please verify this @zhuwenxing

weiliu1031 avatar May 19 '23 02:05 weiliu1031

/assign @zhuwenxing

weiliu1031 avatar May 19 '23 02:05 weiliu1031

It is still reproduced in 2.2.0

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/1115/pipeline

[2023-06-26T08:15:04.286Z] =========================== short test summary info ============================

[2023-06-26T08:15:04.286Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_HNSW_is_compacted_not_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2023-06-26T08:15:04.286Z]  +  where 2 = int('2')

[2023-06-26T08:15:04.286Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2023-06-26T08:15:04.286Z]  +  where 2 = int('2')

[2023-06-26T08:15:04.286Z] =================== 2 failed, 48 passed in 404.73s (0:06:44) ===================

log: artifacts-kafka-cluster-upgrade-1115-server-second-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-1115-server-first-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-1115-pytest-logs.tar.gz

zhuwenxing avatar Jun 26 '23 10:06 zhuwenxing

/assign @weiliu1031

zhuwenxing avatar Jun 26 '23 10:06 zhuwenxing

It is reproduced when using the helm upgrade and will be fixed in 2.3.x.

So this will be considered a known issue.

zhuwenxing avatar Jun 26 '23 10:06 zhuwenxing

The version has bumped to 2.3.1, so I think this issue should be solved in the next version. @weiliu1031

zhuwenxing avatar Oct 10 '23 08:10 zhuwenxing

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Dec 16 '23 18:12 stale[bot]

@zhuwenxing any updates?

xiaofan-luan avatar Dec 17 '23 13:12 xiaofan-luan

not reproduced in 2.4

zhuwenxing avatar Apr 10 '24 07:04 zhuwenxing