milvus
milvus copied to clipboard
[Bug]: [Nightly]Plenty of cases failed with an UnexpectedError, reason: DataCoord 13 is not ready
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: 72485c9
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): kafka
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.0.dev12
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
Plenty of cases failed with an UnexpectedError, reason: DataCoord 13 is not ready.
[2023-04-24T20:18:20.532Z] [2023-04-24 20:17:17 - ERROR - pymilvus.decorators]: RPC error: [create_collection], <MilvusException: (code=1, message=failed to watch channels, code: UnexpectedError, reason: DataCoord 13 is not ready)>, <Time:{'RPC start': '2023-04-24 20:17:17.937854', 'RPC error': '2023-04-24 20:17:17.958027'}> (decorators.py:108)
[2023-04-24T20:18:20.532Z] [2023-04-24 20:17:17 - ERROR - ci_test]: Traceback (most recent call last):
[2023-04-24T20:18:20.532Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper
[2023-04-24T20:18:20.532Z] res = func(*args, **_kwargs)
[2023-04-24T20:18:20.532Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request
[2023-04-24T20:18:20.532Z] return func(*arg, **kwargs)
[2023-04-24T20:18:20.532Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 120, in __init__
[2023-04-24T20:18:20.532Z] conn.create_collection(self._name, schema, shards_num=self._shards_num, **kwargs)
[2023-04-24T20:18:20.532Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 109, in handler
[2023-04-24T20:18:20.532Z] raise e
[2023-04-24T20:18:20.532Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 105, in handler
[2023-04-24T20:18:20.532Z] return func(*args, **kwargs)
[2023-04-24T20:18:20.532Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 136, in handler
[2023-04-24T20:18:20.532Z] ret = func(self, *args, **kwargs)
[2023-04-24T20:18:20.532Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 85, in handler
[2023-04-24T20:18:20.532Z] raise e
[2023-04-24T20:18:20.532Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 50, in handler
[2023-04-24T20:18:20.532Z] return func(self, *args, **kwargs)
[2023-04-24T20:18:20.532Z] File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 212, in create_collection
[2023-04-24T20:18:20.532Z] raise MilvusException(status.error_code, status.reason)
[2023-04-24T20:18:20.532Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=failed to watch channels, code: UnexpectedError, reason: DataCoord 13 is not ready)>
[2023-04-24T20:18:20.532Z] (api_request.py:39)
[2023-04-24T20:18:20.532Z] [2023-04-24 20:17:17 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=failed to watch channels, code: UnexpectedError, reason: DataCoord 13 is not ready)> (api_request.py:40)
Expected Behavior
pass
Steps To Reproduce
No response
Milvus Log
- link: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/352/pipeline/155/
- log: artifacts-milvus-distributed-kafka-nightly-352-pymilvus-e2e-logs.tar.gz
- The error started at this time '[2023-04-24T19:48:53.683Z] [gw0] [ 60%] FAILED testcases/test_search.py::TestSearchPagination::test_search_pagination_with_expression[100-False-float in [+36, 210/2]] '
Anything else?
No response
milvus log: panic info
2023-04-25T04:16:28.999309229+08:00 stdout F [2023/04/24 20:16:28.999 +00:00] [WARN] [grpcclient/client.go:320] ["ClientBase ReCall grpc first call get error "] [role=rootcoord] [error="err: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n, /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:319 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/rootcoord/client/client.go:237 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.(*Client).DescribeCollection\n/go/src/github.com/milvus-io/milvus/internal/datacoord/server.go:1021 github.com/milvus-io/milvus/internal/datacoord.(*Server).hasCollection\n/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:334 github.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).HasCollection.func1\n/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:40 github.com/milvus-io/milvus/pkg/util/retry.Do\n/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:333 github.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).HasCollection\n/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:369 github.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).CheckShouldDropChannel\n/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:882 github.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).isMarkedDrop\n/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:776 github.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).Reassign\n"]
2023-04-25T04:16:28.999368758+08:00 stdout F [2023/04/24 20:16:28.999 +00:00] [ERROR] [datacoord/handler.go:342] ["datacoord ServerHandler HasCollection finally failed"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).HasCollection\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:342\ngithub.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).CheckShouldDropChannel\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:369\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).isMarkedDrop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:882\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).Reassign\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:776\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).processAck\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:655\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).watchChannelStatesLoop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:739"]
2023-04-25T04:16:28.999430064+08:00 stdout F [2023/04/24 20:16:28.999 +00:00] [FATAL] [logutil/logutil.go:132] [panic] [recover=""datacoord ServerHandler HasCollection finally failed""] [stack="github.com/milvus-io/milvus/pkg/util/logutil.LogPanic\n\t/go/src/github.com/milvus-io/milvus/pkg/util/logutil/logutil.go:132\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:838\ngithub.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).HasCollection\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:343\ngithub.com/milvus-io/milvus/internal/datacoord.(*ServerHandler).CheckShouldDropChannel\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/handler.go:369\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).isMarkedDrop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:882\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).Reassign\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:776\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).processAck\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:655\ngithub.com/milvus-io/milvus/internal/datacoord.(*ChannelManager).watchChannelStatesLoop\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/channel_manager.go:739"]
/assign @jiaoew1991 /unassign
/assign @smellthemoon /unassign
DescribeCollection returned CollectionNotExisted
which cause HasCollection retry many times and finally panic.
the expected behavior is if CollectionNotExisted, return false, which exactly fix by #23710 in master and #23709 in 2.2.0.
@NicoYuan1986 plz help check it.
Fixed.