milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [Nightly]Plenty of cases failed with an UnexpectedError, reason: DataCoord 13 is not ready

Open NicoYuan1986 opened this issue 1 year ago • 4 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: 72485c9
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.0.dev12
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Plenty of cases failed with an UnexpectedError, reason: DataCoord 13 is not ready.

[2023-04-24T20:18:20.532Z] [2023-04-24 20:17:17 - ERROR - pymilvus.decorators]: RPC error: [create_collection], <MilvusException: (code=1, message=failed to watch channels, code: UnexpectedError, reason: DataCoord 13 is not ready)>, <Time:{'RPC start': '2023-04-24 20:17:17.937854', 'RPC error': '2023-04-24 20:17:17.958027'}> (decorators.py:108)
[2023-04-24T20:18:20.532Z] [2023-04-24 20:17:17 - ERROR - ci_test]: Traceback (most recent call last):
[2023-04-24T20:18:20.532Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper
[2023-04-24T20:18:20.532Z]     res = func(*args, **_kwargs)
[2023-04-24T20:18:20.532Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request
[2023-04-24T20:18:20.532Z]     return func(*arg, **kwargs)
[2023-04-24T20:18:20.532Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 120, in __init__
[2023-04-24T20:18:20.532Z]     conn.create_collection(self._name, schema, shards_num=self._shards_num, **kwargs)
[2023-04-24T20:18:20.532Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 109, in handler
[2023-04-24T20:18:20.532Z]     raise e
[2023-04-24T20:18:20.532Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 105, in handler
[2023-04-24T20:18:20.532Z]     return func(*args, **kwargs)
[2023-04-24T20:18:20.532Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 136, in handler
[2023-04-24T20:18:20.532Z]     ret = func(self, *args, **kwargs)
[2023-04-24T20:18:20.532Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 85, in handler
[2023-04-24T20:18:20.532Z]     raise e
[2023-04-24T20:18:20.532Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 50, in handler
[2023-04-24T20:18:20.532Z]     return func(self, *args, **kwargs)
[2023-04-24T20:18:20.532Z]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 212, in create_collection
[2023-04-24T20:18:20.532Z]     raise MilvusException(status.error_code, status.reason)
[2023-04-24T20:18:20.532Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=failed to watch channels, code: UnexpectedError, reason: DataCoord 13 is not ready)>
[2023-04-24T20:18:20.532Z]  (api_request.py:39)
[2023-04-24T20:18:20.532Z] [2023-04-24 20:17:17 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=failed to watch channels, code: UnexpectedError, reason: DataCoord 13 is not ready)> (api_request.py:40)

Expected Behavior

pass

Steps To Reproduce

No response

Milvus Log

  1. link: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/352/pipeline/155/
  2. log: artifacts-milvus-distributed-kafka-nightly-352-pymilvus-e2e-logs.tar.gz
  3. The error started at this time '[2023-04-24T19:48:53.683Z] [gw0] [ 60%] FAILED testcases/test_search.py::TestSearchPagination::test_search_pagination_with_expression[100-False-float in [+36, 210/2]] '

Anything else?

No response

NicoYuan1986 avatar Apr 25 '23 02:04 NicoYuan1986

milvus log: panic info

2023-04-25T04:16:28.999309229+08:00 stdout F [2023/04/24 20:16:28.999 +00:00] [WARN] [grpcclient/client.go:320] ["ClientBase ReCall grpc first call get error "] [role=rootcoord] [error="err: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n, /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:319 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/rootcoord/client/client.go:237 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.(*Client).DescribeCollection\n/go/src/github.com/milvus-io/milvus/internal/datacoord/server.go:1021 github.com/milvus-io/milvus/internal/datacoord.(*Server).hasCollection\n/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:334 github.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).HasCollection.func1\n/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:40 github.com/milvus-io/milvus/pkg/util/retry.Do\n/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:333 github.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).HasCollection\n/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:369 github.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).CheckShouldDropChannel\n/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:882 github.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).isMarkedDrop\n/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:776 github.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).Reassign\n"]

2023-04-25T04:16:28.999368758+08:00 stdout F [2023/04/24 20:16:28.999 +00:00] [ERROR] [datacoord/handler.go:342] ["datacoord ServerHandler HasCollection finally failed"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).HasCollection\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:342\ngithub.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).CheckShouldDropChannel\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:369\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).isMarkedDrop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:882\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).Reassign\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:776\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).processAck\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:655\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).watchChannelStatesLoop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:739"]

2023-04-25T04:16:28.999430064+08:00 stdout F [2023/04/24 20:16:28.999 +00:00] [FATAL] [logutil/logutil.go:132] [panic] [recover=""datacoord ServerHandler HasCollection finally failed""] [stack="github.com/milvus-io/milvus/pkg/util/logutil.LogPanic\n\t/go/src/github.com/milvus-io/milvus/pkg/util/logutil/logutil.go:132\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:838\ngithub.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).HasCollection\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:343\ngithub.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).CheckShouldDropChannel\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:369\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).isMarkedDrop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:882\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).Reassign\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:776\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).processAck\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:655\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).watchChannelStatesLoop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:739"]

NicoYuan1986 avatar Apr 25 '23 03:04 NicoYuan1986

/assign @jiaoew1991 /unassign

yanliang567 avatar Apr 25 '23 06:04 yanliang567

/assign @smellthemoon /unassign

jiaoew1991 avatar Apr 26 '23 07:04 jiaoew1991

DescribeCollection returned CollectionNotExisted 2 which cause HasCollection retry many times and finally panic. 1 the expected behavior is if CollectionNotExisted, return false, which exactly fix by #23710 in master and #23709 in 2.2.0. @NicoYuan1986 plz help check it.

smellthemoon avatar Apr 27 '23 03:04 smellthemoon

Fixed.

NicoYuan1986 avatar May 09 '23 03:05 NicoYuan1986