milvus
milvus copied to clipboard
[Bug]: [benchmark] Milvus search failed and report error:"fail to search on all shard leaders"
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:2.2.0-20230308-69f4afe4
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.3.0.dev45
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
argo task : fouramf-stable-1678302000 , id : 10
case_name: test_concurrent_locust_diskann_compaction_cluster querynode reboots multiple times , server:
fouramf-stable-1678302000-10-etcd-0 1/1 Running 0 5h7m 10.104.5.94 4am-node12 <none> <none>
fouramf-stable-1678302000-10-etcd-1 1/1 Running 0 5h7m 10.104.9.172 4am-node14 <none> <none>
fouramf-stable-1678302000-10-etcd-2 1/1 Running 0 5h7m 10.104.4.209 4am-node11 <none> <none>
fouramf-stable-1678302000-10-milvus-datacoord-6955d76b7d-hqfx6 1/1 Running 1 (5h3m ago) 5h7m 10.104.4.193 4am-node11 <none> <none>
fouramf-stable-1678302000-10-milvus-datanode-d5fcc6fd5-hlnlz 1/1 Running 1 (5h3m ago) 5h7m 10.104.1.175 4am-node10 <none> <none>
fouramf-stable-1678302000-10-milvus-indexcoord-7c45c746b9-m7zj8 1/1 Running 1 (5h3m ago) 5h7m 10.104.9.147 4am-node14 <none> <none>
fouramf-stable-1678302000-10-milvus-indexnode-89ff9dd98-p8fwf 1/1 Running 0 5h7m 10.104.5.83 4am-node12 <none> <none>
fouramf-stable-1678302000-10-milvus-proxy-6958fdc6cc-k6gbz 1/1 Running 1 (5h3m ago) 5h7m 10.104.9.148 4am-node14 <none> <none>
fouramf-stable-1678302000-10-milvus-querycoord-74489bf886-ncnph 1/1 Running 1 (5h3m ago) 5h7m 10.104.1.174 4am-node10 <none> <none>
fouramf-stable-1678302000-10-milvus-querynode-669c768656-27sn4 1/1 Running 3 (4h58m ago) 5h7m 10.104.9.149 4am-node14 <none> <none>
fouramf-stable-1678302000-10-milvus-rootcoord-695dd989f4-9z6x9 1/1 Running 1 (5h3m ago) 5h7m 10.104.9.151 4am-node14 <none> <none>
fouramf-stable-1678302000-10-minio-0 1/1 Running 0 5h7m 10.104.9.171 4am-node14 <none> <none>
fouramf-stable-1678302000-10-minio-1 1/1 Running 0 5h7m 10.104.4.198 4am-node11 <none> <none>
fouramf-stable-1678302000-10-minio-2 1/1 Running 0 5h7m 10.104.1.183 4am-node10 <none> <none>
fouramf-stable-1678302000-10-minio-3 1/1 Running 0 5h7m 10.104.5.96 4am-node12 <none> <none>
fouramf-stable-1678302000-10-pulsar-bookie-0 1/1 Running 0 5h7m 10.104.4.208 4am-node11 <none> <none>
fouramf-stable-1678302000-10-pulsar-bookie-1 1/1 Running 0 5h7m 10.104.1.186 4am-node10 <none> <none>
fouramf-stable-1678302000-10-pulsar-bookie-2 1/1 Running 0 5h7m 10.104.9.177 4am-node14 <none> <none>
fouramf-stable-1678302000-10-pulsar-bookie-init-j6jkq 0/1 Completed 0 5h7m 10.104.9.150 4am-node14 <none> <none>
fouramf-stable-1678302000-10-pulsar-broker-0 1/1 Running 0 5h7m 10.104.1.177 4am-node10 <none> <none>
fouramf-stable-1678302000-10-pulsar-proxy-0 1/1 Running 0 5h7m 10.104.4.191 4am-node11 <none> <none>
fouramf-stable-1678302000-10-pulsar-pulsar-init-4fzfj 0/1 Completed 0 5h7m 10.104.4.192 4am-node11 <none> <none>
fouramf-stable-1678302000-10-pulsar-recovery-0 1/1 Running 0 5h7m 10.104.9.152 4am-node14 <none> <none>
fouramf-stable-1678302000-10-pulsar-zookeeper-0 1/1 Running 0 5h7m 10.104.9.170 4am-node14 <none> <none>
fouramf-stable-1678302000-10-pulsar-zookeeper-1 1/1 Running 0 5h6m 10.104.1.203 4am-node10 <none> <none>
fouramf-stable-1678302000-10-pulsar-zookeeper-2 1/1 Running 0 5h5m 10.104.4.223 4am-node11 <none> <none>
client pod:fouramf-stable-1678302000-2776414840
client error:
Expected Behavior
No response
Steps To Reproduce
1. create a collection or use an existing collection
2. build index on vector column
3. insert a certain number of vectors
4. flush collection
5. build index on vector column with the same parameters
6. build index on on scalars column or not
7. count the total number of rows
8. load collection
9. perform concurrent operations
10. clean all collections or not
Milvus Log
No response
Anything else?
No response
argo task: fouramf-stable-1678302000 , id : 6
case name: test_concurrent_locust_diskann_dml_dql_filter_cluster server:
fouramf-stable-1678302000-6-etcd-0 1/1 Running 0 5h7m 10.104.5.109 4am-node12 <none> <none>
fouramf-stable-1678302000-6-etcd-1 1/1 Running 0 5h7m 10.104.6.225 4am-node13 <none> <none>
fouramf-stable-1678302000-6-etcd-2 1/1 Running 0 5h7m 10.104.4.217 4am-node11 <none> <none>
fouramf-stable-1678302000-6-milvus-datacoord-76bc98b585-whwg2 1/1 Running 1 (5h3m ago) 5h7m 10.104.5.87 4am-node12 <none> <none>
fouramf-stable-1678302000-6-milvus-datanode-748bc4489c-bz7wk 1/1 Running 1 (5h3m ago) 5h7m 10.104.4.195 4am-node11 <none> <none>
fouramf-stable-1678302000-6-milvus-indexcoord-7f59d66b7f-6chkt 1/1 Running 1 (5h3m ago) 5h7m 10.104.6.212 4am-node13 <none> <none>
fouramf-stable-1678302000-6-milvus-indexnode-8645d599b-6q87p 1/1 Running 0 5h7m 10.104.9.161 4am-node14 <none> <none>
fouramf-stable-1678302000-6-milvus-proxy-5f45674595-4w8p2 1/1 Running 1 (5h3m ago) 5h7m 10.104.6.213 4am-node13 <none> <none>
fouramf-stable-1678302000-6-milvus-querycoord-79556d7988-7r2zz 1/1 Running 1 (5h3m ago) 5h7m 10.104.5.88 4am-node12 <none> <none>
fouramf-stable-1678302000-6-milvus-querynode-58c996d994-f8f4n 1/1 Running 1 (5h ago) 5h7m 10.104.6.214 4am-node13 <none> <none>
fouramf-stable-1678302000-6-milvus-rootcoord-65449775d6-q9zpj 1/1 Running 1 (5h3m ago) 5h7m 10.104.5.84 4am-node12 <none> <none>
fouramf-stable-1678302000-6-minio-0 1/1 Running 0 5h7m 10.104.5.104 4am-node12 <none> <none>
fouramf-stable-1678302000-6-minio-1 1/1 Running 0 5h7m 10.104.1.193 4am-node10 <none> <none>
fouramf-stable-1678302000-6-minio-2 1/1 Running 0 5h7m 10.104.6.223 4am-node13 <none> <none>
fouramf-stable-1678302000-6-minio-3 1/1 Running 0 5h7m 10.104.4.216 4am-node11 <none> <none>
fouramf-stable-1678302000-6-pulsar-bookie-0 1/1 Running 0 5h7m 10.104.1.200 4am-node10 <none> <none>
fouramf-stable-1678302000-6-pulsar-bookie-1 1/1 Running 0 5h7m 10.104.5.113 4am-node12 <none> <none>
fouramf-stable-1678302000-6-pulsar-bookie-2 1/1 Running 0 5h7m 10.104.6.228 4am-node13 <none> <none>
fouramf-stable-1678302000-6-pulsar-bookie-init-cb6c8 0/1 Completed 0 5h7m 10.104.1.178 4am-node10 <none> <none>
fouramf-stable-1678302000-6-pulsar-broker-0 1/1 Running 0 5h7m 10.104.1.181 4am-node10 <none> <none>
fouramf-stable-1678302000-6-pulsar-proxy-0 1/1 Running 0 5h7m 10.104.5.89 4am-node12 <none> <none>
fouramf-stable-1678302000-6-pulsar-pulsar-init-lhxlx 0/1 Completed 0 5h7m 10.104.9.162 4am-node14 <none> <none>
fouramf-stable-1678302000-6-pulsar-recovery-0 1/1 Running 0 5h7m 10.104.5.90 4am-node12 <none> <none>
fouramf-stable-1678302000-6-pulsar-zookeeper-0 1/1 Running 0 5h7m 10.104.5.108 4am-node12 <none> <none>
fouramf-stable-1678302000-6-pulsar-zookeeper-1 1/1 Running 0 5h5m 10.104.4.221 4am-node11 <none> <none>
fouramf-stable-1678302000-6-pulsar-zookeeper-2 1/1 Running 0 5h4m 10.104.1.207 4am-node10 <none> <none>
client error :
Steps To Reproduce
1. create a collection or use an existing collection
2. build index on vector column
3. insert a certain number of vectors
4. flush collection
5. build index on vector column with the same parameters
6. build index on on scalars column or not
7. count the total number of rows
8. load collection
9. perform concurrent operations
10. clean all collections or not
/assign @jiaoew1991 /unassign
/assign @aoiasd /unassign
Case 1: QueryCoord update current target, and two segments will be dropped in next target, but before we reach next target, QueryNode restart and will not reload dropped segment, so QueryCoord will failed to search because could not find this two segment till target update to next target. Case 2: Just because QueryNode not reload complete after restart.
The only question was that QueryNode crash for no reason, and we could not see any panic or c++ error log.
Case 1: QueryCoord update current target, and two segment will be dropped in next target, but QueryNode restart and will not reload dropped segment, so QueryCoord will failed to search because could not find this two segment till target update to next target. Case 2: Just because QueryNode not reload complete after restart.
The only question QueryNode crash for no reason, and we could not see any panic or c++ error log.
@aoiasd @xige-16 Maybe you can watch it together
This issue still exists.
fouramf-stable-1680548400-6-etcd-0 1/1 Running 0 5h8m 10.104.9.131 4am-node14 <none> <none>
fouramf-stable-1680548400-6-etcd-1 1/1 Running 0 5h8m 10.104.4.233 4am-node11 <none> <none>
fouramf-stable-1680548400-6-etcd-2 1/1 Running 0 5h8m 10.104.1.124 4am-node10 <none> <none>
fouramf-stable-1680548400-6-milvus-datacoord-7d8685984c-vlbb8 1/1 Running 1 (5h4m ago) 5h8m 10.104.1.104 4am-node10 <none> <none>
fouramf-stable-1680548400-6-milvus-datanode-8b754c95b-sxztr 1/1 Running 1 (5h4m ago) 5h8m 10.104.1.105 4am-node10 <none> <none>
fouramf-stable-1680548400-6-milvus-indexcoord-855777b5c4-xkvwr 1/1 Running 1 (5h4m ago) 5h8m 10.104.6.83 4am-node13 <none> <none>
fouramf-stable-1680548400-6-milvus-indexnode-874d4684c-gljpx 1/1 Running 0 5h8m 10.104.5.36 4am-node12 <none> <none>
fouramf-stable-1680548400-6-milvus-proxy-54fdf6d8c6-8z6m2 1/1 Running 1 (5h4m ago) 5h8m 10.104.9.126 4am-node14 <none> <none>
fouramf-stable-1680548400-6-milvus-querycoord-669899bd89-4nzhv 1/1 Running 1 (5h4m ago) 5h8m 10.104.9.124 4am-node14 <none> <none>
fouramf-stable-1680548400-6-milvus-querynode-7f5f8548c7-mxz5b 1/1 Running 1 (5h ago) 5h8m 10.104.6.85 4am-node13 <none> <none>
fouramf-stable-1680548400-6-milvus-rootcoord-5d75c4fb55-nsxdb 1/1 Running 1 (5h4m ago) 5h8m 10.104.6.84 4am-node13 <none> <none>
fouramf-stable-1680548400-6-minio-0 1/1 Running 0 5h8m 10.104.4.235 4am-node11 <none> <none>
fouramf-stable-1680548400-6-minio-1 1/1 Running 0 5h8m 10.104.9.136 4am-node14 <none> <none>
fouramf-stable-1680548400-6-minio-2 1/1 Running 0 5h8m 10.104.1.128 4am-node10 <none> <none>
fouramf-stable-1680548400-6-minio-3 1/1 Running 0 5h8m 10.104.5.50 4am-node12 <none> <none>
fouramf-stable-1680548400-6-pulsar-bookie-0 1/1 Running 0 5h8m 10.104.4.231 4am-node11 <none> <none>
fouramf-stable-1680548400-6-pulsar-bookie-1 1/1 Running 0 5h8m 10.104.9.134 4am-node14 <none> <none>
fouramf-stable-1680548400-6-pulsar-bookie-2 1/1 Running 0 5h8m 10.104.1.127 4am-node10 <none> <none>
fouramf-stable-1680548400-6-pulsar-bookie-init-q86vw 0/1 Completed 0 5h8m 10.104.4.208 4am-node11 <none> <none>
fouramf-stable-1680548400-6-pulsar-broker-0 1/1 Running 0 5h8m 10.104.4.206 4am-node11 <none> <none>
fouramf-stable-1680548400-6-pulsar-proxy-0 1/1 Running 0 5h8m 10.104.4.207 4am-node11 <none> <none>
fouramf-stable-1680548400-6-pulsar-pulsar-init-g2s8v 0/1 Completed 0 5h8m 10.104.4.205 4am-node11 <none> <none>
fouramf-stable-1680548400-6-pulsar-recovery-0 1/1 Running 0 5h8m 10.104.9.123 4am-node14 <none> <none>
fouramf-stable-1680548400-6-pulsar-zookeeper-0 1/1 Running 0 5h8m 10.104.4.229 4am-node11 <none> <none>
fouramf-stable-1680548400-6-pulsar-zookeeper-1 1/1 Running 0 5h6m 10.104.9.154 4am-node14 <none> <none>
fouramf-stable-1680548400-6-pulsar-zookeeper-2 1/1 Running 0 5h4m 10.104.4.3 4am-node11 <none> <none>
full client log: fouram_log (1).log.zip
This issue still exists.
image: 2.2.6-20230413-d0e87113 (Expected 2.2.6 release version) case_name:test_concurrent_locust_diskann_compaction_cluster argo task : fouramf-stable-2gdbz , id : 10
server:
fouramf-stable-2gdbz-10-etcd-0 1/1 Running 0 5h8m 10.104.6.73 4am-node13 <none> <none>
fouramf-stable-2gdbz-10-etcd-1 1/1 Running 0 5h7m 10.104.9.15 4am-node14 <none> <none>
fouramf-stable-2gdbz-10-etcd-2 1/1 Running 0 5h7m 10.104.4.158 4am-node11 <none> <none>
fouramf-stable-2gdbz-10-milvus-datacoord-5f6c7db7db-gw228 1/1 Running 1 (5h4m ago) 5h8m 10.104.5.235 4am-node12 <none> <none>
fouramf-stable-2gdbz-10-milvus-datanode-57db6fc569-bzlcc 1/1 Running 1 (5h4m ago) 5h8m 10.104.4.130 4am-node11 <none> <none>
fouramf-stable-2gdbz-10-milvus-indexcoord-6df4586695-jgsj6 1/1 Running 1 (5h3m ago) 5h8m 10.104.9.245 4am-node14 <none> <none>
fouramf-stable-2gdbz-10-milvus-indexnode-6fbd8bd696-z9vc7 1/1 Running 0 5h8m 10.104.5.236 4am-node12 <none> <none>
fouramf-stable-2gdbz-10-milvus-proxy-bd6d746c5-hg7hr 1/1 Running 1 (5h3m ago) 5h8m 10.104.9.250 4am-node14 <none> <none>
fouramf-stable-2gdbz-10-milvus-querycoord-76cdb79456-mwnqg 1/1 Running 1 (5h4m ago) 5h8m 10.104.4.131 4am-node11 <none> <none>
fouramf-stable-2gdbz-10-milvus-querynode-85b54bdf5d-2694r 1/1 Running 1 (5h ago) 5h8m 10.104.1.87 4am-node10 <none> <none>
fouramf-stable-2gdbz-10-milvus-rootcoord-6986664fbb-mm72v 1/1 Running 1 (5h3m ago) 5h8m 10.104.5.237 4am-node12 <none> <none>
fouramf-stable-2gdbz-10-minio-0 1/1 Running 0 5h8m 10.104.4.154 4am-node11 <none> <none>
fouramf-stable-2gdbz-10-minio-1 1/1 Running 0 5h8m 10.104.1.113 4am-node10 <none> <none>
fouramf-stable-2gdbz-10-minio-2 1/1 Running 0 5h8m 10.104.5.8 4am-node12 <none> <none>
fouramf-stable-2gdbz-10-minio-3 1/1 Running 0 5h7m 10.104.6.76 4am-node13 <none> <none>
fouramf-stable-2gdbz-10-pulsar-bookie-0 1/1 Running 0 5h8m 10.104.9.13 4am-node14 <none> <none>
fouramf-stable-2gdbz-10-pulsar-bookie-1 1/1 Running 0 5h7m 10.104.6.75 4am-node13 <none> <none>
fouramf-stable-2gdbz-10-pulsar-bookie-2 1/1 Running 0 5h7m 10.104.4.159 4am-node11 <none> <none>
fouramf-stable-2gdbz-10-pulsar-bookie-init-qwtj4 0/1 Completed 0 5h8m 10.104.9.249 4am-node14 <none> <none>
fouramf-stable-2gdbz-10-pulsar-broker-0 1/1 Running 0 5h8m 10.104.6.45 4am-node13 <none> <none>
fouramf-stable-2gdbz-10-pulsar-proxy-0 1/1 Running 0 5h8m 10.104.4.133 4am-node11 <none> <none>
fouramf-stable-2gdbz-10-pulsar-pulsar-init-cmxz4 0/1 Completed 0 5h8m 10.104.9.247 4am-node14 <none> <none>
fouramf-stable-2gdbz-10-pulsar-recovery-0 1/1 Running 0 5h8m 10.104.9.248 4am-node14 <none> <none>
fouramf-stable-2gdbz-10-pulsar-zookeeper-0 1/1 Running 0 5h8m 10.104.9.12 4am-node14 <none> <none>
fouramf-stable-2gdbz-10-pulsar-zookeeper-1 1/1 Running 0 5h5m 10.104.1.122 4am-node10 <none> <none>
fouramf-stable-2gdbz-10-pulsar-zookeeper-2 1/1 Running 0 5h4m 10.104.4.178 4am-node11 <none> <none>
client error log:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
The issue hasn't come up again . Verify the image: 2.2.0-20230803-6a20862c