milvus
milvus copied to clipboard
[Bug]: [benchmark][cluster] Milvus reinstall with same pvc, search , query raise an error"<_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.DEADLINE_EXCEEDED"
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:2.2.0-20221118-b494b564
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2):2.2.0dev72
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
After restart server-instance fouram-tag-no-clean-p4l24-1 server-configmap server-cluster-8c64m-querynode2 client-configmap client-random-locust-100m-hnsw-ddl-r8-w2-12h-con image:2.2.0-20221117-c65306bc pymilvus: 2.2.0dev70
reinstall with image 2.2.0-20221118-b494b564 server-instance fouram-tag-no-clean-z2mmt-1 server-configmap server-cluster-8c64m-querynode2-kafka client-configmap client-random-locust-100m-hnsw-ddl-r8-w2-60h-con
sever:
fouram-tag-no-clean-z2mmt-1-etcd-0 1/1 Running 0 2d13h 10.104.5.118 4am-node12 <none> <none>
fouram-tag-no-clean-z2mmt-1-etcd-1 1/1 Running 0 2d13h 10.104.9.5 4am-node14 <none> <none>
fouram-tag-no-clean-z2mmt-1-etcd-2 1/1 Running 0 2d13h 10.104.1.8 4am-node10 <none> <none>
fouram-tag-no-clean-z2mmt-1-kafka-0 2/2 Running 0 2d13h 10.104.5.120 4am-node12 <none> <none>
fouram-tag-no-clean-z2mmt-1-kafka-1 2/2 Running 0 2d13h 10.104.9.7 4am-node14 <none> <none>
fouram-tag-no-clean-z2mmt-1-kafka-2 2/2 Running 0 2d13h 10.104.6.47 4am-node13 <none> <none>
fouram-tag-no-clean-z2mmt-1-kafka-exporter-85d6df8b68-gqj7t 1/1 Running 4 (2d13h ago) 2d13h 10.104.6.42 4am-node13 <none> <none>
fouram-tag-no-clean-z2mmt-1-milvus-datacoord-78f6c94c78-nq2dz 1/1 Running 0 2d13h 10.104.4.232 4am-node11 <none> <none>
fouram-tag-no-clean-z2mmt-1-milvus-datanode-79dc5fd57c-drnbm 1/1 Running 0 2d13h 10.104.4.233 4am-node11 <none> <none>
fouram-tag-no-clean-z2mmt-1-milvus-indexcoord-75587c7897-5qwxn 1/1 Running 0 2d13h 10.104.6.43 4am-node13 <none> <none>
fouram-tag-no-clean-z2mmt-1-milvus-indexnode-86b944456b-kmwlp 1/1 Running 0 2d13h 10.104.6.45 4am-node13 <none> <none>
fouram-tag-no-clean-z2mmt-1-milvus-proxy-8649b79c89-7tmbv 1/1 Running 0 2d13h 10.104.6.39 4am-node13 <none> <none>
fouram-tag-no-clean-z2mmt-1-milvus-querycoord-77d97b478-cw7rz 1/1 Running 0 2d13h 10.104.6.41 4am-node13 <none> <none>
fouram-tag-no-clean-z2mmt-1-milvus-querynode-678545b976-9pk7n 1/1 Running 0 2d13h 10.104.4.230 4am-node11 <none> <none>
fouram-tag-no-clean-z2mmt-1-milvus-querynode-678545b976-h8djj 1/1 Running 0 2d13h 10.104.6.40 4am-node13 <none> <none>
fouram-tag-no-clean-z2mmt-1-milvus-rootcoord-56bb644b5b-n2k4c 1/1 Running 0 2d13h 10.104.4.231 4am-node11 <none> <none>
fouram-tag-no-clean-z2mmt-1-minio-0 1/1 Running 0 2d13h 10.104.5.117 4am-node12 <none> <none>
fouram-tag-no-clean-z2mmt-1-minio-1 1/1 Running 0 2d13h 10.104.9.6 4am-node14 <none> <none>
fouram-tag-no-clean-z2mmt-1-minio-2 1/1 Running 0 2d13h 10.104.1.9 4am-node10 <none> <none>
fouram-tag-no-clean-z2mmt-1-minio-3 1/1 Running 0 2d13h 10.104.6.44 4am-node13 <none> <none>
fouram-tag-no-clean-z2mmt-1-zookeeper-0 1/1 Running 0 2d13h 10.104.6.46 4am-node13 <none> <none>
fouram-tag-no-clean-z2mmt-1-zookeeper-1 1/1 Running 0 2d13h 10.104.4.234 4am-node11 <none> <none>
fouram-tag-no-clean-z2mmt-1-zookeeper-2 1/1 Running 0 2d13h 10.104.5.119 4am-node12 <none> <none>
client log:
Expected Behavior
No response
Steps To Reproduce
1. create a collection
2. build hnsw index
3. insert 100m data
4. build index again
5. load collection
6. search, load, query, scene_test search normal
7. uninstall milvus, reinstall milvus with 2.2.0-20221118-b494b564, stop 1 hours
8. search , query, raise error
Milvus Log
No response
Anything else?
data:
config.yaml: |
locust_random_concurrent_performance:
collections:
-
collection_name: sift_100m_128_l2
ni_per: 50000
build_index: true
index_type: hnsw
index_param:
M: 8
efConstruction: 200
task:
types:
-
type: query
weight: 8
params:
top_k: 10
nq: 10
search_param:
ef: 16
-
type: load
weight: 1
-
type: get
weight: 8
params:
ids_length: 10
-
type: scene_test
weight: 2
connection_num: 1
clients_num: 20
spawn_rate: 2
during_time: 60h
/assign @jiaoew1991 /unassign
/assign @weiliu1031 /unassign please look at this issue
- query coord side meet a minute-level timestamp lag, and when search/query task submit, it need to wait, until timestamp is serviceable.
- for now, we found that the consumed timestamp on data node is already delayed.
and for the next,I will check for the timestamp delay reason.
for now, we found that root coord send tt to kafka which has a higher latency than expected(200ms).
Conclusion:
the speed which data node consume from kafka is much more slower than query node. so data node will send older delete message to query node, and which cause query node delta channel's tsafe
is minute-level behind dml channel. so when task submitted, it will wait for timestamp serviceable until timeout(60s)
root cause: kafka's read and write performance seems unstable
Conclusion: the speed which data node consume from kafka is much more slower than query node. so data node will send older delete message to query node, and which cause query node delta channel's
tsafe
is minute-level behind dml channel. so when task submitted, it will wait for timestamp serviceable until timeout(60s)root cause: kafka's read and write performance seems unstable
Question, why the kafka produce taken so long? If this happens in production environment, how do we catch up? Maybe we can skip some of the tts if we are cathing up?
for instance, if delta tt is lag behind (Say 10 minutes), the resend logic happens every 1 min rather than 200 ms
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.