milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [benchmark] milvus panic reported an error `query coordinator id allocator initialize failed`

Open elstic opened this issue 10 months ago • 2 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:2.4-20240424-e1b4ef74
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task : new-stable-master-1713981600, id : 2

SCANN index inserted 1 million data loads and then continued concurrent reads and writes, milvus panic and disconnected from etcd causing milvus to reboot.

server

new-stable-mast81600-2-66-2426-etcd-0                             1/1     Running                           0                 5m40s   10.104.27.44    4am-node31   <none>           <none>
new-stable-mast81600-2-66-2426-milvus-standalone-7d6f45c9fhtmtn   1/1     Running                           1 (117s ago)      5m40s   10.104.20.17    4am-node22   <none>           <none>
new-stable-mast81600-2-66-2426-minio-58c486f9c8-ksxv7             1/1     Running                           0                 5m40s   10.104.27.52    4am-node31   <none>           <none> (base.py:257)
[2024-04-24 23:12:51,184 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'NAME|new-stable-mast81600-2-66-2426-milvus|new-stable-mast81600-2-66-2426-minio|new-stable-mast81600-2-66-2426-etcd|new-stable-mast81600-2-66-2426-pulsar|new-stable-mast81600-2-66-2426-zookeeper|new-stable-mast81600-2-66-2426-kafka|new-stable-mast81600-2-66-2426-log|new-stable-mast81600-2-66-2426-tikv'  (util_cmd.py:14)
[2024-04-24 23:13:01,232 -  INFO - fouram]: [CliClient] pod details of release(new-stable-mast81600-2-66-2426): 
 I0424 23:12:52.480523     451 request.go:665] Waited for 1.154773944s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/storage.k8s.io/v1beta1?timeout=32s
NAME                                                              READY   STATUS                            RESTARTS          AGE     IP              NODE         NOMINATED NODE   READINESS GATES
new-stable-mast81600-2-66-2426-etcd-0                             1/1     Running                           0                 5h10m   10.104.27.44    4am-node31   <none>           <none>
new-stable-mast81600-2-66-2426-milvus-standalone-7d6f45c9fhtmtn   1/1     Running                           3 (3h32m ago)     5h10m   10.104.20.17    4am-node22   <none>           <none>
new-stable-mast81600-2-66-2426-minio-58c486f9c8-ksxv7             1/1     Running                           0                 5h10m   10.104.27.52    4am-node31   <none>           <none> 

client pod : new-stable-master-1713981600-1576406165 client log image

panic : image

disconnected from etcd image

Expected Behavior

No response

Steps To Reproduce

1. create a collection  
  2. build an SCANN index on the vector column
  3. insert 1m vectors
  4. flush collection
  5. build index on vector column with the same parameters  
  6. count the total number of rows
  7. load collection
  8. execute concurrent search, query, flush, insert ,delete
  9. step 8 lasts 5h

Milvus Log

No response

Anything else?

test env: 4am cluster

elstic avatar Apr 25 '24 04:04 elstic

/assign @congqixia /unassign

yanliang567 avatar Apr 26 '24 01:04 yanliang567

image image

From the log above, some components failed to connect to etcd. Most likely it's a env problem

congqixia avatar Apr 26 '24 06:04 congqixia

The issue is fixed, I'll turn it off

elstic avatar Jun 26 '24 07:06 elstic