milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [benchmark][cluster] grpc RpcError: [load_collection], <_MultiThreadedRendezvou s: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>

Open ThreadDao opened this issue 2 years ago • 10 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: master-20220505-333b439a
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.1.0.dev39
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others: queryNode.replicas: 2

Current Behavior

argo task: benchmark-backup-jn2b5 (https://argo-test.milvus.io/workflows/qa/benchmark-backup-jn2b5?tab=workflow)

test yaml: server-configmap: server-cluster-8c16m-querynode-2 client-configmap: testing-load-release

client pod: benchmark-backup-jn2b5-4256292640 client log file: locust_report_2022-05-05_424.log

A few occasional errors in concurrent interface requests client log:

[2022-05-05 07:48:01,249] [   ERROR] - RPC error: [load_collection], <MilvusException: (code=1, message=collection 43299257108706438
5 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2022-05-05 07:46:57.077613', 'RPC error': '2022-05-05 07:48:0
1.249801'}> (pymilvus.decorators:73)
[2022-05-05 07:48:01,260] [   DEBUG] - params of load collection: {'timeout': 30, 'replica_number': 2} (milvus_benchmark.client:490)
[2022-05-05 07:48:01,261] [   ERROR] - RPC error: [wait_for_loading_collection], <MilvusException: (code=1, message=collection 43299
2571087064385 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2022-05-05 07:48:01.235907', 'RPC error': '2022-0
5-05 07:48:01.261527'}> (pymilvus.decorators:73)

server:

NAME                                                          READY   STATUS      RESTARTS   AGE     IP             NODE NOMINATED NODE   READINESS GATES
benchmark-backup-jn2b5-1-etcd-0                               1/1     Running     0          3m40s   10.97.16.100   qa-node013.zilliz.local <none>           <none>
benchmark-backup-jn2b5-1-etcd-1                               1/1     Running     0          3m40s   10.97.17.204   qa-node014.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-etcd-2                               1/1     Running     0          3m39s   10.97.16.102   qa-node013.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-milvus-datacoord-5bf8bb69f6-l8gd6    1/1     Running     0          3m40s   10.97.4.16     qa-node002.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-milvus-datanode-d84d5db9c-44l6f      1/1     Running     0          3m40s   10.97.20.97    qa-node018.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-milvus-indexcoord-698fbd6497-knhsx   1/1     Running     0          3m40s   10.97.4.12     qa-node002.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-milvus-indexnode-68cc8cf588-55jkv    1/1     Running     0          3m40s   10.97.20.96    qa-node018.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-milvus-proxy-6c6b4fb8dc-p772f        1/1     Running     0          3m40s   10.97.4.14     qa-node002.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-milvus-querycoord-84b85f7574-fmsps   1/1     Running     0          3m40s   10.97.4.15     qa-node002.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-milvus-querynode-97f6f7ff8-6vn8n     1/1     Running     0          3m40s   10.97.16.98    qa-node013.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-milvus-querynode-97f6f7ff8-9tbx2     1/1     Running     0          3m40s   10.97.17.202   qa-node014.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-milvus-rootcoord-6cd9b68f9b-cslk5    1/1     Running     0          3m40s   10.97.4.13     qa-node002.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-minio-0                              1/1     Running     0          3m39s   10.97.19.151   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-minio-1                              1/1     Running     0          3m39s   10.97.12.25    qa-node015.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-minio-2                              1/1     Running     0          3m39s   10.97.12.27    qa-node015.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-minio-3                              1/1     Running     0          3m39s   10.97.12.29    qa-node015.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-pulsar-bookie-0                      1/1     Running     0          3m40s   10.97.9.153    qa-node007.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-pulsar-bookie-1                      1/1     Running     0          3m39s   10.97.19.153   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-pulsar-bookie-2                      1/1     Running     0          3m39s   10.97.3.252    qa-node001.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-pulsar-bookie-init-9z6z8             0/1     Completed   0          3m40s   10.97.9.150    qa-node007.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-pulsar-broker-0                      1/1     Running     0          3m40s   10.97.11.156   qa-node009.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-pulsar-proxy-0                       1/1     Running     0          3m40s   10.97.5.197    qa-node003.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-pulsar-pulsar-init-ph25v             0/1     Completed   0          3m40s   10.97.3.247    qa-node001.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-pulsar-recovery-0                    1/1     Running     0          3m40s   10.97.12.22    qa-node015.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-pulsar-zookeeper-0                   1/1     Running     0          3m40s   10.97.3.249    qa-node001.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-pulsar-zookeeper-1                   1/1     Running     0          3m18s   10.97.18.208   qa-node017.zilliz.local   <none>           <none>
benchmark-backup-jn2b5-1-pulsar-zookeeper-2                   1/1     Running     0          2m44s   10.97.12.32    qa-node015.zilliz.local   <none>           <none>

Expected Behavior

Load collection doesn't throw exception

Steps To Reproduce

1、create collection
2、create index of ivf_sq8
3、insert 10 million vectors
4、flush collection
5、build index with the same params
6、load collection with 2 replica number
7、locust concurrent: load 2 replicas, release《- raise error

Milvus Log

No response

Anything else?

locust_random_performance:
		  collections:
		    -
		      # collection_name: sift_100m_128_l2
		      collection_name: sift_10m_128_l2
		      other_fields: float
		      ni_per: 50000
		      build_index: true
		      index_type: ivf_sq8
		      index_param:
		        nlist: 1024
		      load_param:
		        replica_number: 2
		      task:
		        types:
		          -
		            type: release
		            weight: 1
		          -
		            type: load
		            weight: 1
		            params:
		              replica_number: 2
		        connection_num: 1
		        clients_num: 20
		        spawn_rate: 2
		        during_time: 1h
    def load(self):
        op = "load"
        params = {"timeout": 30}
        if self.params[op]:
            replica_number = self.params[op]["replica_number"] if "replica_number" in self.params[op] else 1
            params.update({"replica_number": replica_number})
        self.client.load_collection(**params)

    def release(self):
        self.client.release_collection()

ThreadDao avatar May 05 '22 09:05 ThreadDao

/assign @congqixia /unassign

yanliang567 avatar May 05 '22 13:05 yanliang567

@ThreadDao according to the log, there was a release request before the first load operation is done. May I ask what is the expect behavior of this situation?

congqixia avatar May 06 '22 06:05 congqixia

@ThreadDao according to the log, there was a release request before the first load operation is done. May I ask what is the expect behavior of this situation?

We hope that the release request comes when the load is not completed, the release request times out, and the collection is finally loaded. But can we modify the release exception message in this scenario to distinguish it from timeouts caused by internal server errors?

ThreadDao avatar May 07 '22 07:05 ThreadDao

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jun 06 '22 07:06 stale[bot]

Are you going to fix it soon? @congqixia

ThreadDao avatar Jun 06 '22 08:06 ThreadDao

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jul 22 '22 01:07 stale[bot]

Leave it alone

ThreadDao avatar Jul 22 '22 07:07 ThreadDao

/assign @SimFG pls help on it

xiaofan-luan avatar Aug 10 '22 15:08 xiaofan-luan

@xiaofan-luan: GitHub didn't allow me to assign the following users: pls, help, on, it.

Note that only milvus-io members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign @SimFG pls help on it

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sre-ci-robot avatar Aug 10 '22 15:08 sre-ci-robot

Confirmed the problem with @congqixia, get the following conclusions. The querycoord processing task is a serial queue. The reason for the problem is that the load has not been executed, and then the release has entered the queue. Because the load execution time is long, the release operation has timed out.What needs to be defined here is the behavior of concurrent operations of load and release. Because the querycoord has began to refactor, it will be solved when the new version of the querycoord arrives. cc @ThreadDao @xiaofan-luan

SimFG avatar Aug 11 '22 11:08 SimFG

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Sep 10 '22 14:09 stale[bot]

I will test with the new version of the querycoord

ThreadDao avatar Sep 14 '22 09:09 ThreadDao

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Oct 31 '22 02:10 stale[bot]