milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [kafka] Rpc timeout when listing collections during nightly tests

Open binbinlv opened this issue 2 years ago • 19 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus 69
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Rpc timeout when listing collections during nightly tests

[2022-06-11T16:13:26.043Z] __ TestCollectionParams.test_collection_shards_num_with_not_default_value[0] ___

[2022-06-11T16:13:26.043Z] [gw3] linux -- Python 3.6.8 /usr/local/bin/python3

[2022-06-11T16:13:26.044Z] 

[2022-06-11T16:13:26.044Z] self = <test_collection.TestCollectionParams object at 0x7f3a867efa90>

[2022-06-11T16:13:26.044Z] shards_num = 0

[2022-06-11T16:13:26.044Z] 

[2022-06-11T16:13:26.044Z]     @pytest.mark.tags(CaseLabel.L0)

[2022-06-11T16:13:26.044Z]     @pytest.mark.parametrize("shards_num", [-256, 0, 10, 256])

[2022-06-11T16:13:26.044Z]     def test_collection_shards_num_with_not_default_value(self, shards_num):

[2022-06-11T16:13:26.044Z]         """

[2022-06-11T16:13:26.044Z]         target:test collection with shards_num

[2022-06-11T16:13:26.044Z]         method:create collection with not default shards_num

[2022-06-11T16:13:26.044Z]         expected: no exception

[2022-06-11T16:13:26.044Z]         """

[2022-06-11T16:13:26.044Z]         self._connect()

[2022-06-11T16:13:26.044Z]         c_name = cf.gen_unique_str(prefix)

[2022-06-11T16:13:26.044Z]         self.collection_wrap.init_collection(c_name, schema=default_schema, shards_num=shards_num,

[2022-06-11T16:13:26.044Z]                                              check_task=CheckTasks.check_collection_property,

[2022-06-11T16:13:26.044Z]                                              check_items={exp_name: c_name, exp_shards_num: shards_num})

[2022-06-11T16:13:26.044Z] >       assert c_name in self.utility_wrap.list_collections()[0]

[2022-06-11T16:13:26.044Z] 

[2022-06-11T16:13:26.044Z] testcases/test_collection.py:919: 

[2022-06-11T16:13:26.044Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2022-06-11T16:13:26.044Z] base/utility_wrapper.py:152: in list_collections

[2022-06-11T16:13:26.044Z]     timeout=timeout, using=using).run()

[2022-06-11T16:13:26.044Z] check/func_check.py:33: in run

[2022-06-11T16:13:26.044Z]     result = self.assert_succ(self.succ, True)

[2022-06-11T16:13:26.044Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2022-06-11T16:13:26.044Z] 

[2022-06-11T16:13:26.044Z] actual = False, expect = True

[2022-06-11T16:13:26.044Z] 

[2022-06-11T16:13:26.044Z]     @staticmethod

[2022-06-11T16:13:26.044Z]     def assert_succ(actual, expect):

[2022-06-11T16:13:26.044Z] >       assert actual is expect

[2022-06-11T16:13:26.044Z] E       AssertionError

[2022-06-11T16:13:26.044Z] 

[2022-06-11T16:13:26.044Z] check/func_check.py:81: AssertionError

[2022-06-11T16:13:26.044Z] ------------------------------ Captured log setup ------------------------------

[2022-06-11T16:13:26.044Z] [2022-06-11 15:00:49 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:36)

[2022-06-11T16:13:26.044Z] [2022-06-11 15:00:49 - INFO - ci_test]: [setup_method] Start setup test case test_collection_shards_num_with_not_default_value. (client_base.py:37)

[2022-06-11T16:13:26.044Z] ------------------------------ Captured log call -------------------------------

[2022-06-11T16:13:26.044Z] [2022-06-11 15:00:49 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default'], kwargs: {'host': 'mdk-586-n-milvus.milvus-ci', 'port': '19530'} (api_request.py:55)

[2022-06-11T16:13:26.044Z] [2022-06-11 15:00:49 - DEBUG - ci_test]: (api_response) : None  (api_request.py:27)

[2022-06-11T16:13:26.044Z] [2022-06-11 15:00:49 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['collection_D81xAbYI', {

[2022-06-11T16:13:26.044Z]   auto_id: False

[2022-06-11T16:13:26.044Z]   description: 

[2022-06-11T16:13:26.044Z]   fields: [{

[2022-06-11T16:13:26.044Z]     name: int64

[2022-06-11T16:13:26.044Z]     description: 

[2022-06-11T16:13:26.044Z]     type: 5

[2022-06-11T16:13:26.044Z]     is_primary: True

[2022-06-11T16:13:26.044Z]     auto_id: False

[2022-06-11T16:13:26.044Z]   }, {

[2022-06-11T16:13:26.044Z]     name: float

[2022-06-11T16:13:26.044Z]     description: 

[2022-06-11T16:13:26.044Z]     type: 10

[2022-06-11T16:13:26.044Z]   }, {

[2022-06-11T16:13:26.044Z]     name: varchar

[2022-06-11T16:13:26.044Z]     description: 

[2022-06-11T16:13:26.044Z]     type: 21

[2022-06-11T16:13:26.044Z]     params: {'max_length': 65535......, kwargs: {'consistency_level': 'Strong'} (api_request.py:55)

[2022-06-11T16:13:26.044Z] [2022-06-11 15:00:51 - DEBUG - ci_test]: (api_response) : <Collection>:

[2022-06-11T16:13:26.044Z] -------------

[2022-06-11T16:13:26.044Z] <name>: collection_D81xAbYI

[2022-06-11T16:13:26.044Z] <partitions>: [{"name": "_default", "collection_name": "collection_D81xAbYI", "description": ""}]

[2022-06-11T16:13:26.044Z] <description>: 

[2022-06-11T16:13:26.044Z] <schema>: {

[2022-06-11T16:13:26.044Z]   auto_id: False

[2022-06-11T16:13:26.044Z]   description: 

[2022-06-11T16:13:26.044Z]   fields: [{

[2022-06-11T16:13:26.044Z]     name: int64

[2022-06-11T16:13:26.044Z]     description: 

[2022-06-11T16:13:26.044Z]     type: 5

[2022-06-11T16:13:26.044Z]     is_primary: True

[2022-06-11T16:13:26.044Z]     a......  (api_request.py:27)

[2022-06-11T16:13:26.044Z] [2022-06-11 15:00:51 - DEBUG - ci_test]: (api_request)  : [list_collections] args: [20, 'default'], kwargs: {} (api_request.py:55)

[2022-06-11T16:13:26.044Z] [2022-06-11 15:01:11 - ERROR - pymilvus.decorators]: grpc RpcError: [list_collections], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-06-11 15:00:51.401561', 'gRPC error': '2022-06-11 15:01:11.402917'}> (decorators.py:86)

[2022-06-11T16:13:26.044Z] [2022-06-11 15:01:11 - ERROR - ci_test]: Traceback (most recent call last):

[2022-06-11T16:13:26.044Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 44, in handler

[2022-06-11T16:13:26.044Z]     return func(self, *args, **kwargs)

[2022-06-11T16:13:26.045Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 87, in handler

[2022-06-11T16:13:26.045Z]     raise e

[2022-06-11T16:13:26.045Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 75, in handler

[2022-06-11T16:13:26.045Z]     return func(*args, **kwargs)

[2022-06-11T16:13:26.045Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/client/grpc_handler.py", line 232, in list_collections

[2022-06-11T16:13:26.045Z]     response = rf.result()

[2022-06-11T16:13:26.045Z]   File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 744, in result

[2022-06-11T16:13:26.045Z]     raise self

[2022-06-11T16:13:26.045Z] grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:

[2022-06-11T16:13:26.045Z] 	status = StatusCode.DEADLINE_EXCEEDED

[2022-06-11T16:13:26.045Z] 	details = "Deadline Exceeded"

[2022-06-11T16:13:26.045Z] 	debug_error_string = "{"created":"@1654959671.402449969","description":"Error received from peer ipv4:10.201.218.89:19530","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Deadline Exceeded","grpc_status":4}"

[2022-06-11T16:13:26.045Z] >

[2022-06-11T16:13:26.045Z] 

[2022-06-11T16:13:26.045Z] During handling of the above exception, another exception occurred:

[2022-06-11T16:13:26.045Z] 

[2022-06-11T16:13:26.045Z] Traceback (most recent call last):

[2022-06-11T16:13:26.045Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 22, in inner_wrapper

[2022-06-11T16:13:26.045Z]     res = func(*args, **kwargs)

[2022-06-11T16:13:26.045Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 56, in api_request

[2022-06-11T16:13:26.045Z]     return func(*arg, **kwargs)

[2022-06-11T16:13:26.045Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/orm/utility.py", line 430, in list_collections

[2022-06-11T16:13:26.045Z]     return _get_connection(using).list_collections(timeout=timeout)

[2022-06-11T16:13:26.045Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 55, in handler

[2022-06-11T16:13:26.045Z]     raise MilvusException(Status.UNEXPECTED_ERROR, "rpc timeout")

[2022-06-11T16:13:26.045Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=rpc timeout)>

[2022-06-11T16:13:26.045Z]  (api_request.py:35)

[2022-06-11T16:13:26.045Z] [2022-06-11 15:01:11 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=rpc timeout)> (api_request.py:36)

Expected Behavior

No timeout

Steps To Reproduce

Running nightly:

Milvus Log

artifacts-milvus-distributed-kafka-master-586-pymilvus-e2e-logs.tar.gz

  1. FAILED timeline: [2022-06-11T15:03:34.904Z] [gw3] [ 6%] FAILED testcases/test_collection.py::TestCollectionParams::test_collection_shards_num_with_not_default_value[0]

Anything else?

No response

binbinlv avatar Jun 13 '22 08:06 binbinlv

Same error during the following nightly: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/587/pipeline

  1. milvus log: artifacts-milvus-distributed-kafka-master-587-pymilvus-e2e-logs.tar.gz

  2. Failed timeline: [2022-06-11T19:01:46.497Z] [gw0] [ 6%] FAILED testcases/test_collection.py::TestCollectionParams::test_collection_shards_num_with_default_value

  3. test logs:

[2022-06-11T20:13:09.258Z] self = <test_collection.TestCollectionParams object at 0x7f6cee13a5c0>
[2022-06-11T20:13:09.258Z] 
[2022-06-11T20:13:09.258Z]     @pytest.mark.tags(CaseLabel.L0)
[2022-06-11T20:13:09.258Z]     def test_collection_shards_num_with_default_value(self):
[2022-06-11T20:13:09.258Z]         """
[2022-06-11T20:13:09.258Z]         target:test collection with shards_num
[2022-06-11T20:13:09.258Z]         method:create collection with shards_num
[2022-06-11T20:13:09.258Z]         expected: no exception
[2022-06-11T20:13:09.258Z]         """
[2022-06-11T20:13:09.258Z]         self._connect()
[2022-06-11T20:13:09.258Z]         c_name = cf.gen_unique_str(prefix)
[2022-06-11T20:13:09.258Z]         self.collection_wrap.init_collection(c_name, schema=default_schema, shards_num=default_shards_num,
[2022-06-11T20:13:09.258Z]                                              check_task=CheckTasks.check_collection_property,
[2022-06-11T20:13:09.259Z]                                              check_items={exp_name: c_name, exp_shards_num: default_shards_num})
[2022-06-11T20:13:09.259Z] >       assert c_name in self.utility_wrap.list_collections()[0]
[2022-06-11T20:13:09.259Z] 
[2022-06-11T20:13:09.259Z] testcases/test_collection.py:904: 
[2022-06-11T20:13:09.259Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-06-11T20:13:09.259Z] base/utility_wrapper.py:152: in list_collections
[2022-06-11T20:13:09.259Z]     timeout=timeout, using=using).run()
[2022-06-11T20:13:09.259Z] check/func_check.py:33: in run
[2022-06-11T20:13:09.259Z]     result = self.assert_succ(self.succ, True)
[2022-06-11T20:13:09.259Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-06-11T20:13:09.259Z] 
[2022-06-11T20:13:09.259Z] actual = False, expect = True
[2022-06-11T20:13:09.259Z] 
[2022-06-11T20:13:09.259Z]     @staticmethod
[2022-06-11T20:13:09.259Z]     def assert_succ(actual, expect):
[2022-06-11T20:13:09.259Z] >       assert actual is expect
[2022-06-11T20:13:09.259Z] E       AssertionError
[2022-06-11T20:13:09.259Z] 
[2022-06-11T20:13:09.259Z] check/func_check.py:81: AssertionError
[2022-06-11T20:13:09.259Z] ------------------------------ Captured log setup ------------------------------
[2022-06-11T20:13:09.259Z] [2022-06-11 19:01:20 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:36)
[2022-06-11T20:13:09.259Z] [2022-06-11 19:01:20 - INFO - ci_test]: [setup_method] Start setup test case test_collection_shards_num_with_default_value. (client_base.py:37)
[2022-06-11T20:13:09.259Z] ------------------------------ Captured log call -------------------------------
[2022-06-11T20:13:09.259Z] [2022-06-11 19:01:20 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default'], kwargs: {'host': 'mdk-587-n-milvus.milvus-ci', 'port': '19530'} (api_request.py:55)
[2022-06-11T20:13:09.259Z] [2022-06-11 19:01:20 - DEBUG - ci_test]: (api_response) : None  (api_request.py:27)
[2022-06-11T20:13:09.259Z] [2022-06-11 19:01:20 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['collection_o49PzkiW', {
[2022-06-11T20:13:09.259Z]   auto_id: False
[2022-06-11T20:13:09.259Z]   description: 
[2022-06-11T20:13:09.259Z]   fields: [{
[2022-06-11T20:13:09.259Z]     name: int64
[2022-06-11T20:13:09.259Z]     description: 
[2022-06-11T20:13:09.259Z]     type: 5
[2022-06-11T20:13:09.259Z]     is_primary: True
[2022-06-11T20:13:09.259Z]     auto_id: False
[2022-06-11T20:13:09.259Z]   }, {
[2022-06-11T20:13:09.259Z]     name: float
[2022-06-11T20:13:09.259Z]     description: 
[2022-06-11T20:13:09.259Z]     type: 10
[2022-06-11T20:13:09.259Z]   }, {
[2022-06-11T20:13:09.259Z]     name: varchar
[2022-06-11T20:13:09.259Z]     description: 
[2022-06-11T20:13:09.259Z]     type: 21
[2022-06-11T20:13:09.259Z]     params: {'max_length': 65535......, kwargs: {'consistency_level': 'Strong'} (api_request.py:55)
[2022-06-11T20:13:09.259Z] [2022-06-11 19:01:22 - DEBUG - ci_test]: (api_response) : <Collection>:
[2022-06-11T20:13:09.259Z] -------------
[2022-06-11T20:13:09.259Z] <name>: collection_o49PzkiW
[2022-06-11T20:13:09.259Z] <partitions>: [{"name": "_default", "collection_name": "collection_o49PzkiW", "description": ""}]
[2022-06-11T20:13:09.259Z] <description>: 
[2022-06-11T20:13:09.259Z] <schema>: {
[2022-06-11T20:13:09.259Z]   auto_id: False
[2022-06-11T20:13:09.259Z]   description: 
[2022-06-11T20:13:09.259Z]   fields: [{
[2022-06-11T20:13:09.259Z]     name: int64
[2022-06-11T20:13:09.259Z]     description: 
[2022-06-11T20:13:09.259Z]     type: 5
[2022-06-11T20:13:09.259Z]     is_primary: True
[2022-06-11T20:13:09.259Z]     a......  (api_request.py:27)
[2022-06-11T20:13:09.259Z] [2022-06-11 19:01:22 - DEBUG - ci_test]: (api_request)  : [list_collections] args: [20, 'default'], kwargs: {} (api_request.py:55)
[2022-06-11T20:13:09.259Z] [2022-06-11 19:01:42 - ERROR - pymilvus.decorators]: grpc RpcError: [list_collections], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-06-11 19:01:22.409004', 'gRPC error': '2022-06-11 19:01:42.410403'}> (decorators.py:86)
[2022-06-11T20:13:09.259Z] [2022-06-11 19:01:42 - ERROR - ci_test]: Traceback (most recent call last):
[2022-06-11T20:13:09.259Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 44, in handler
[2022-06-11T20:13:09.259Z]     return func(self, *args, **kwargs)
[2022-06-11T20:13:09.259Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 87, in handler
[2022-06-11T20:13:09.259Z]     raise e
[2022-06-11T20:13:09.259Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 75, in handler
[2022-06-11T20:13:09.259Z]     return func(*args, **kwargs)
[2022-06-11T20:13:09.259Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/client/grpc_handler.py", line 232, in list_collections
[2022-06-11T20:13:09.259Z]     response = rf.result()
[2022-06-11T20:13:09.259Z]   File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 744, in result
[2022-06-11T20:13:09.259Z]     raise self
[2022-06-11T20:13:09.259Z] grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
[2022-06-11T20:13:09.259Z] 	status = StatusCode.DEADLINE_EXCEEDED
[2022-06-11T20:13:09.259Z] 	details = "Deadline Exceeded"
[2022-06-11T20:13:09.259Z] 	debug_error_string = "{"created":"@1654974102.410064803","description":"Error received from peer ipv4:10.201.236.64:19530","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Deadline Exceeded","grpc_status":4}"
[2022-06-11T20:13:09.259Z] >
[2022-06-11T20:13:09.259Z] 
[2022-06-11T20:13:09.259Z] During handling of the above exception, another exception occurred:
[2022-06-11T20:13:09.259Z] 
[2022-06-11T20:13:09.259Z] Traceback (most recent call last):
[2022-06-11T20:13:09.259Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 22, in inner_wrapper
[2022-06-11T20:13:09.259Z]     res = func(*args, **kwargs)
[2022-06-11T20:13:09.259Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 56, in api_request
[2022-06-11T20:13:09.259Z]     return func(*arg, **kwargs)
[2022-06-11T20:13:09.259Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/orm/utility.py", line 430, in list_collections
[2022-06-11T20:13:09.259Z]     return _get_connection(using).list_collections(timeout=timeout)
[2022-06-11T20:13:09.259Z]   File "/usr/local/lib/python3.6/site-packages/pymilvus/decorators.py", line 55, in handler
[2022-06-11T20:13:09.259Z]     raise MilvusException(Status.UNEXPECTED_ERROR, "rpc timeout")
[2022-06-11T20:13:09.259Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=rpc timeout)>
[2022-06-11T20:13:09.259Z]  (api_request.py:35)
[2022-06-11T20:13:09.259Z] [2022-06-11 19:01:42 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=rpc timeout)> (api_request.py:36)

binbinlv avatar Jun 13 '22 08:06 binbinlv

It has been reproduced for several nightly runnings, so set it as urgent.

binbinlv avatar Jun 13 '22 09:06 binbinlv

postpone to 2.1 GA

yanliang567 avatar Jun 14 '22 09:06 yanliang567

ShowCollection RPC seems doesn't reach to rootcoord, the request of the corresponding messageID does not find within rootcoord

jaime0815 avatar Jun 14 '22 12:06 jaime0815

showCollectionsTask of proxy be canceled until context timeout, so it likes a deadlock or scheduling issue for the task. I tried to run this case 1000 times with the default parallel number on the local environment, but the issue can not be reproduced.

pytest --flake-finder --flake-runs=100 -n auto  testcases/test_collection.py::TestCollectionParams::test_collection_shards_num_with_default_value

jaime0815 avatar Jun 14 '22 12:06 jaime0815

there is a normal show collections log, we can see the request of the same msgID with proxy and rootcoord

./mdk-587-n-milvus-rootcoord-546f9cd966-2mwt6_milvus-ci_rootcoord-a627ce2cbab632cad3c8032c8888016c3c5db83530abaa2b2f4954b0bad096cc.log:7665:[2022/06/11 19:01:22.406 +00:00] [DEBUG] [root_coord.go:1500] [ShowCollections] [role=rootcoord] [dbname=] [msgID=433841495717714268]
./mdk-587-n-milvus-rootcoord-546f9cd966-2mwt6_milvus-ci_rootcoord-a627ce2cbab632cad3c8032c8888016c3c5db83530abaa2b2f4954b0bad096cc.log:7667:[2022/06/11 19:01:22.406 +00:00] [DEBUG] [root_coord.go:1519] ["ShowCollections success"] [role=rootcoord] [dbname=] ["num of collections"=3] [msgID=433841495717714268]
./mdk-587-n-milvus-proxy-77f98865dc-xvfcv_milvus-ci_proxy-94b9fe0959a992696162b19ce3f74d1d3db72b7c168959df43508d1cba82898e.log:5107:[2022/06/11 19:01:20.469 +00:00] [DEBUG] [impl.go:857] ["ShowCollections enqueued"] [role=proxy] [MsgID=433841495717714268] [DbName=] [TimeStamp=0] [ShowType=All] [CollectionNames="[]"]
./mdk-587-n-milvus-proxy-77f98865dc-xvfcv_milvus-ci_proxy-94b9fe0959a992696162b19ce3f74d1d3db72b7c168959df43508d1cba82898e.log:5139:[2022/06/11 19:01:22.405 +00:00] [DEBUG] [impl.go:888] ["ShowCollections Done"] [role=proxy] [MsgID=433841495717714268] [DbName=] [TimeStamp=0] [ShowType=All] [len(CollectionNames)=0] [num_collections=3]

this is logs about list collection failed

./mdk-587-n-milvus-proxy-77f98865dc-xvfcv_milvus-ci_proxy-94b9fe0959a992696162b19ce3f74d1d3db72b7c168959df43508d1cba82898e.log:5191:[2022/06/11 19:01:24.416 +00:00] [DEBUG] [impl.go:857] ["ShowCollections enqueued"] [role=proxy] [MsgID=433841495717714275] [DbName=] [TimeStamp=0] [ShowType=All] [CollectionNames="[]"]
./mdk-587-n-milvus-proxy-77f98865dc-xvfcv_milvus-ci_proxy-94b9fe0959a992696162b19ce3f74d1d3db72b7c168959df43508d1cba82898e.log:5369:[2022/06/11 19:01:44.416 +00:00] [WARN] [impl.go:868] ["ShowCollections failed to WaitToFinish"] [error="proxy TaskCondition context Done"] [role=proxy] [MsgID=433841495717714275] [DbName=] [TimeStamp=0] [ShowType=All] [CollectionNames="[]"]

jaime0815 avatar Jun 14 '22 12:06 jaime0815

DescribeCollection has been executed before ListCollection, it also fails. the reason is a CreateCollection request is executing before DescribeCollection within the task scheduler queue, CreateCollection costs 2.25 minutes, it will be led to the next requests timeout of waiting in the queue.

./mdk-587-n-milvus-rootcoord-546f9cd966-2mwt6_milvus-ci_rootcoord-a627ce2cbab632cad3c8032c8888016c3c5db83530abaa2b2f4954b0bad096cc.log:7685:[2022/06/11 19:01:22.408 +00:00] [DEBUG] [task.go:139] ["collection name -> id"] ["collection name"=collection_AEDZNlIh] [collection_id=433841525851160577] ["default partition id"=433841525851160578]
./mdk-587-n-milvus-rootcoord-546f9cd966-2mwt6_milvus-ci_rootcoord-a627ce2cbab632cad3c8032c8888016c3c5db83530abaa2b2f4954b0bad096cc.log:10767:[2022/06/11 19:03:47.714 +00:00] [DEBUG] [root_coord.go:1369] ["CreateCollection success"] [role=rootcoord] ["collection name"=collection_AEDZNlIh] [msgID=433841495717714272]

jaime0815 avatar Jun 15 '22 03:06 jaime0815

CreateCollection too slowly caused by broadcast message to Kafka, but it is no more log to track.

the following is creating the collection log within rootcoord

./mdk-587-n-milvus-proxy-77f98865dc-xvfcv_milvus-ci_proxy-94b9fe0959a992696162b19ce3f74d1d3db72b7c168959df43508d1cba82898e.log:4953:[2022/06/11 19:01:18.410 +00:00] [DEBUG] [impl.go:923] ["CreatePartition received"] [traceID=744fa506d989afb2] [role=proxy] [db=] [collection=coll__y[2022/06/11 19:01:22.408 +00:00] [DEBUG] [task.go:139] ["collection name -> id"] ["collection name"=collection_AEDZNlIh] [collection_id=433841525851160577] ["default partition id"=433841525851160578]
[2022/06/11 19:01:22.483 +00:00] [INFO] [timeticksync.go:233] ["ddl not finished"] ["source id"=11] ["curr ts"=433841525851160577] [ddlMinTs=433841525510635524] [reason=gRPC]
[2022/06/11 19:01:24.394 +00:00] [DEBUG] [timeticksync.go:179] ["add ddl timetick"] [minTs=433841525851160582] [ts=433841525851160582] [len(ddlTsSet)=1] [reason="create collection 433841525851160577"]
[2022/06/11 19:03:47.132 +00:00] [DEBUG] [timeticksync.go:190] ["remove ddl timetick"] [ts=433841525851160582] [len(ddlTsSet)=0] [reason="create collection 433841525851160577"]
[2022/06/11 19:03:47.132 +00:00] [DEBUG] [timeticksync.go:190] ["remove ddl timetick"] [ts=433841525851160582] [len(ddlTsSet)=0] [reason="create collection 433841525851160577"]

jaime0815 avatar Jun 15 '22 03:06 jaime0815

Same issue occurs in the following nightly again:

https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/615/pipeline/190/

milvus logs: artifacts-milvus-distributed-kafka-master-615-pymilvus-e2e-logs.tar.gz

collection name: compact_n6v8Gg6j

binbinlv avatar Jun 27 '22 04:06 binbinlv

It occurs again in the nightly for 2.1.0 branch: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-release-nightly/detail/milvus-release-nightly/134/pipeline

  1. milvus log: milvus-distributed-kafka-134-pymilvus-e2e-logs.tar.gz

  2. collection name: collection_6fHdt4bH

  3. FAILED time [2022-07-04T23:48:09.895Z] [gw3] [ 5%] FAILED testcases/test_collection.py::TestCollectionParams::test_collection_shards_num_with_not_default_value[10]

binbinlv avatar Jul 05 '22 02:07 binbinlv

The same reason that CreateCollection is too slowly caused by broadcast message to Kafka.

23:45:08.253 start drop collection collection_tUV3lzup 23:45:08.265 start create collection collection_6fHdt4bH 23:45:08.281 start create collection collection_YvL6gOsQ 23:45:08.295 ListCollection enqueued 23:45:18.048 create collection collection_6fHdt4bH finish (cost 10s) 23:45:28.253 drop collection collection_tUV3lzup fail 23:45:28.295 ListCollection fail 23:47:58.227 create collection collection_YvL6gOsQ finish(cost 50s)

[2022/07/04 23:45:08.295 +00:00] [DEBUG] [impl.go:857] ["ShowCollections enqueued"] [role=proxy] [MsgID=434366898372084180] [DbName=] [TimeStamp=0] [ShowType=All] [CollectionNames="[]"]
[2022/07/04 23:45:28.295 +00:00] [WARN] [impl.go:868] ["ShowCollections failed to WaitToFinish"] [error="proxy TaskCondition context Done"] [role=proxy] [MsgID=434366898372084180] [DbName=] [TimeStamp=0] [ShowType=All] [CollectionNames="[]"]


[2022/07/04 23:45:08.265 +00:00] [DEBUG] [task.go:153] ["collection name -> id"] ["collection name"=collection_6fHdt4bH] [collection_id=434366921637363713] ["default partition id"=434366921637363714]
[2022/07/04 23:45:08.265 +00:00] [DEBUG] [timeticksync.go:179] ["add ddl timetick"] [minTs=434366921637363716] [ts=434366921637363716] [len(ddlTsSet)=1] [reason="create collection 434366921637363713"]
[2022/07/04 23:45:18.048 +00:00] [DEBUG] [timeticksync.go:190] ["remove ddl timetick"] [ts=434366921637363716] [len(ddlTsSet)=0] [reason="create collection 434366921637363713"]


[2022/07/04 23:45:08.281 +00:00] [DEBUG] [task.go:153] ["collection name -> id"] ["collection name"=collection_YvL6gOsQ] [collection_id=434366921637363715] ["default partition id"=434366921637363716]
[2022/07/04 23:45:18.048 +00:00] [DEBUG] [timeticksync.go:179] ["add ddl timetick"] [minTs=434366921637363722] [ts=434366921637363722] [len(ddlTsSet)=1] [reason="create collection 434366921637363715"]
[2022/07/04 23:47:58.227 +00:00] [DEBUG] [timeticksync.go:190] ["remove ddl timetick"] [ts=434366921637363722] [len(ddlTsSet)=0] [reason="create collection 434366921637363715"]

2022/07/04 23:45:08.253 +00:00] [DEBUG] [impl.go:304] ["DropCollection enqueued"] [traceID=5de64e3da0a8b4b] [role=proxy] [MsgID=434366901504707748] [BeginTs=434366921637363717] [EndTs=434366921637363717] [db=] [collection=collection_tUV3lzup]
[2022/07/04 23:45:18.064 +00:00] [DEBUG] [root_coord.go:1543] [DropCollection] [role=rootcoord] ["collection name"=collection_tUV3lzup] [msgID=434366901504707748]
[2022/07/04 23:45:18.052 +00:00] [INFO] [impl.go:296] ["releaseCollectionRequest received"] [role=querycoord] [collectionID=434366919579533315] [msgID=0]
[2022/07/04 23:45:28.253 +00:00] [WARN] [impl.go:314] ["DropCollection failed to WaitToFinish"] [error="proxy TaskCondition context Done"] [traceID=5de64e3da0a8b4b] [role=proxy] [MsgID=434366901504707748] [BeginTs=434366921637363717] [EndTs=434366921637363717] [db=] [collection=collection_tUV3lzup]

jaime0815 avatar Jul 08 '22 07:07 jaime0815

It occurs again in the nightly for 2.1.0 branch: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-release-nightly/detail/milvus-release-nightly/150/pipeline/160

  1. milvus log: milvus-distributed-kafka-150-pymilvus-e2e-logs.tar.gz
  2. collection name: collection_QYeDe3cX
  3. Failed time: [2022-07-12T23:47:53.670Z] [gw5] [ 4%] FAILED testcases/test_collection.py::TestCollectionParams::test_collection_binary

binbinlv avatar Jul 13 '22 02:07 binbinlv

It occurs again in the nightly for 2.1.0 branch: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-release-nightly/detail/milvus-release-nightly/157/pipeline/179

  1. milvus log: milvus-distributed-kafka-157-pymilvus-e2e-logs.tar.gz
  2. collection name: collection_48bBHm3t
  3. Failed time: [2022-07-16T20:48:10.124Z] [gw3] [ 5%] FAILED testcases/test_collection.py::TestCollectionParams::test_collection_shards_num_with_not_default_value[0]

binbinlv avatar Jul 18 '22 02:07 binbinlv

It occurs again in the nightly for 2.1.0 branch: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-release-nightly/detail/milvus-release-nightly/158/pipeline/179

  1. milvus log: milvus-distributed-kafka-158-pymilvus-e2e-logs.tar.gz
  2. collection name: collection_143B5LuA
  3. Failed time: [2022-07-16T23:46:21.736Z] [gw5] [ 4%] FAILED testcases/test_collection.py::TestCollectionParams::test_collection_binary

binbinlv avatar Jul 18 '22 02:07 binbinlv

RPC timeout for "create_alias". Considering this is collection related interface rpc timeout too, so could you help to have a look first to check whether the root cause is the same with this issue? If not, I will open another issue, thanks. @jaime0815

https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/668/pipeline/157

  1. milvus log:artifacts-milvus-distributed-kafka-master-668-pymilvus-e2e-logs.tar.gz
  2. collection name: collection_p4yoS4ks
  3. Failed time:[2022-07-21T19:01:00.230Z] [gw2] [ 2%] FAILED testcases/test_alias.py::TestAliasOperationInvalid::test_alias_create_duplication_alias

binbinlv avatar Jul 25 '22 02:07 binbinlv

It occurs again in the nightly for 2.1.0 branch:

  1. milvus log: milvus-distributed-kafka-178-pymilvus-e2e-logs.tar.gz
  2. collection name: collection_nG72GkTk
  3. Failed time: [2022-07-25T23:48:30.014Z] [gw5] [ 5%] FAILED testcases/test_collection.py::TestCollectionParams::test_collection_binary

binbinlv avatar Jul 26 '22 02:07 binbinlv

Same issue in the master nightly: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/12/pipeline/155

(1) milvus logs: artifacts-milvus-distributed-kafka-nightly-12-pymilvus-e2e-logs.tar.gz (2) collection name: collection_z5qNI4cp (3) FAILED time: [2022-08-02T19:05:45.141Z] [gw4] [ 7%] FAILED testcases/test_collection.py::TestCollectionParams::test_collection_shards_num_with_not_default_value[10]

binbinlv avatar Aug 03 '22 01:08 binbinlv

Same issue in the master nightly: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/15/pipeline/171 (1) milvus log: artifacts-milvus-distributed-kafka-nightly-15-pymilvus-e2e-logs.tar.gz (2) collection name: collection_hfpO0kIJ (3) FAILED time: [2022-08-04T15:05:25.199Z] [gw2] [ 6%] FAILED testcases/test_collection.py::TestCollectionParams::test_collection_shards_num_with_not_default_value[0]

binbinlv avatar Aug 05 '22 03:08 binbinlv

@jaime0815 do we have reason or fix about broadcast too slow? if we concurrent do the broad case it shouldn't be too slow right?

xiaofan-luan avatar Aug 10 '22 15:08 xiaofan-luan

Same issue in the master nightly: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/38/pipeline/155

  1. milvus log: artifacts-milvus-distributed-kafka-nightly-38-pymilvus-e2e-logs.tar.gz
  2. collection name: collection_luYTyndk
  3. Failed time: [2022-08-15T15:04:16.740Z] [gw0] [ 3%] FAILED testcases/test_collection.py::TestCollectionParams::test_collection_binary

binbinlv avatar Aug 16 '22 02:08 binbinlv

I think this is the same with #18900, so close this one.

binbinlv avatar Sep 01 '22 07:09 binbinlv

@jaime0815 do we have reason or fix about broadcast too slow? if we concurrent do the broad case it shouldn't be too slow right?

It is possible to broadcast too slow at any time if mq service has GC、network issues etc, we should cover this situation.

The reason why this issue occurs frequently is all request tasks will be executed in one task queue, one request task costs a long time and it will cause a timeout for the next tasks easily.

The solution is following:

  1. add concurrency
  2. add a fast fail mechanism to avoid failing for other request tasks, that is add a limitation time for each request.
  3. ensure mq has a stable performance as far as possible.

cc @weiliu1031

jaime0815 avatar Sep 06 '22 08:09 jaime0815