milvus
milvus copied to clipboard
[Bug]: [benchmark][cluster] grpc RpcError: [load_collection], <_MultiThreadedRendezvou s: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: master-20220505-333b439a
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.1.0.dev39
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others: queryNode.replicas: 2
Current Behavior
argo task: benchmark-backup-jn2b5 (https://argo-test.milvus.io/workflows/qa/benchmark-backup-jn2b5?tab=workflow)
test yaml: server-configmap: server-cluster-8c16m-querynode-2 client-configmap: testing-load-release
client pod: benchmark-backup-jn2b5-4256292640 client log file: locust_report_2022-05-05_424.log
A few occasional errors in concurrent interface requests client log:
[2022-05-05 07:48:01,249] [ ERROR] - RPC error: [load_collection], <MilvusException: (code=1, message=collection 43299257108706438
5 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2022-05-05 07:46:57.077613', 'RPC error': '2022-05-05 07:48:0
1.249801'}> (pymilvus.decorators:73)
[2022-05-05 07:48:01,260] [ DEBUG] - params of load collection: {'timeout': 30, 'replica_number': 2} (milvus_benchmark.client:490)
[2022-05-05 07:48:01,261] [ ERROR] - RPC error: [wait_for_loading_collection], <MilvusException: (code=1, message=collection 43299
2571087064385 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2022-05-05 07:48:01.235907', 'RPC error': '2022-0
5-05 07:48:01.261527'}> (pymilvus.decorators:73)
server:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
benchmark-backup-jn2b5-1-etcd-0 1/1 Running 0 3m40s 10.97.16.100 qa-node013.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-etcd-1 1/1 Running 0 3m40s 10.97.17.204 qa-node014.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-etcd-2 1/1 Running 0 3m39s 10.97.16.102 qa-node013.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-milvus-datacoord-5bf8bb69f6-l8gd6 1/1 Running 0 3m40s 10.97.4.16 qa-node002.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-milvus-datanode-d84d5db9c-44l6f 1/1 Running 0 3m40s 10.97.20.97 qa-node018.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-milvus-indexcoord-698fbd6497-knhsx 1/1 Running 0 3m40s 10.97.4.12 qa-node002.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-milvus-indexnode-68cc8cf588-55jkv 1/1 Running 0 3m40s 10.97.20.96 qa-node018.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-milvus-proxy-6c6b4fb8dc-p772f 1/1 Running 0 3m40s 10.97.4.14 qa-node002.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-milvus-querycoord-84b85f7574-fmsps 1/1 Running 0 3m40s 10.97.4.15 qa-node002.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-milvus-querynode-97f6f7ff8-6vn8n 1/1 Running 0 3m40s 10.97.16.98 qa-node013.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-milvus-querynode-97f6f7ff8-9tbx2 1/1 Running 0 3m40s 10.97.17.202 qa-node014.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-milvus-rootcoord-6cd9b68f9b-cslk5 1/1 Running 0 3m40s 10.97.4.13 qa-node002.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-minio-0 1/1 Running 0 3m39s 10.97.19.151 qa-node016.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-minio-1 1/1 Running 0 3m39s 10.97.12.25 qa-node015.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-minio-2 1/1 Running 0 3m39s 10.97.12.27 qa-node015.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-minio-3 1/1 Running 0 3m39s 10.97.12.29 qa-node015.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-pulsar-bookie-0 1/1 Running 0 3m40s 10.97.9.153 qa-node007.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-pulsar-bookie-1 1/1 Running 0 3m39s 10.97.19.153 qa-node016.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-pulsar-bookie-2 1/1 Running 0 3m39s 10.97.3.252 qa-node001.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-pulsar-bookie-init-9z6z8 0/1 Completed 0 3m40s 10.97.9.150 qa-node007.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-pulsar-broker-0 1/1 Running 0 3m40s 10.97.11.156 qa-node009.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-pulsar-proxy-0 1/1 Running 0 3m40s 10.97.5.197 qa-node003.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-pulsar-pulsar-init-ph25v 0/1 Completed 0 3m40s 10.97.3.247 qa-node001.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-pulsar-recovery-0 1/1 Running 0 3m40s 10.97.12.22 qa-node015.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-pulsar-zookeeper-0 1/1 Running 0 3m40s 10.97.3.249 qa-node001.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-pulsar-zookeeper-1 1/1 Running 0 3m18s 10.97.18.208 qa-node017.zilliz.local <none> <none>
benchmark-backup-jn2b5-1-pulsar-zookeeper-2 1/1 Running 0 2m44s 10.97.12.32 qa-node015.zilliz.local <none> <none>
Expected Behavior
Load collection doesn't throw exception
Steps To Reproduce
1、create collection
2、create index of ivf_sq8
3、insert 10 million vectors
4、flush collection
5、build index with the same params
6、load collection with 2 replica number
7、locust concurrent: load 2 replicas, release《- raise error
Milvus Log
No response
Anything else?
locust_random_performance:
collections:
-
# collection_name: sift_100m_128_l2
collection_name: sift_10m_128_l2
other_fields: float
ni_per: 50000
build_index: true
index_type: ivf_sq8
index_param:
nlist: 1024
load_param:
replica_number: 2
task:
types:
-
type: release
weight: 1
-
type: load
weight: 1
params:
replica_number: 2
connection_num: 1
clients_num: 20
spawn_rate: 2
during_time: 1h
def load(self):
op = "load"
params = {"timeout": 30}
if self.params[op]:
replica_number = self.params[op]["replica_number"] if "replica_number" in self.params[op] else 1
params.update({"replica_number": replica_number})
self.client.load_collection(**params)
def release(self):
self.client.release_collection()
/assign @congqixia /unassign
@ThreadDao according to the log, there was a release request before the first load operation is done. May I ask what is the expect behavior of this situation?
@ThreadDao according to the log, there was a release request before the first load operation is done. May I ask what is the expect behavior of this situation?
We hope that the release
request comes when the load
is not completed, the release
request times out, and the collection is finally loaded. But can we modify the release
exception message in this scenario to distinguish it from timeouts caused by internal server errors?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
Are you going to fix it soon? @congqixia
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
Leave it alone
/assign @SimFG pls help on it
@xiaofan-luan: GitHub didn't allow me to assign the following users: pls, help, on, it.
Note that only milvus-io members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
In response to this:
/assign @SimFG pls help on it
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Confirmed the problem with @congqixia, get the following conclusions. The querycoord processing task is a serial queue. The reason for the problem is that the load has not been executed, and then the release has entered the queue. Because the load execution time is long, the release operation has timed out.What needs to be defined here is the behavior of concurrent operations of load and release. Because the querycoord has began to refactor, it will be solved when the new version of the querycoord arrives. cc @ThreadDao @xiaofan-luan
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
I will test with the new version of the querycoord
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.