milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [benchmark] Milvus reboots during continuous concurrent querying.

Open elstic opened this issue 1 year ago • 10 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:2.3-20240103-6bf46c6f
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

After inserting 1 million vectors and continuing to query, the milvus appeared and restarted.

server

perf-standalone95800-5-80-8957-etcd-0                             1/1     Running     0                14m     10.104.19.67    4am-node28   <none>           <none>
perf-standalone95800-5-80-8957-milvus-standalone-6fc69f7bckqdsm   1/1     Running     0                14m     10.104.28.156   4am-node33   <none>           <none>
perf-standalone95800-5-80-8957-minio-7cfb98ff54-vnggq             1/1     Running     0                14m     10.104.27.10    4am-node31   <none>           <none> (base.py:257)
[2024-01-03 17:54:26,915 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'STATUS|perf-standalone95800-5-80-8957-milvus|perf-standalone95800-5-80-8957-minio|perf-standalone95800-5-80-8957-etcd|perf-standalone95800-5-80-8957-pulsar|perf-standalone95800-5-80-8957-kafka'  (util_cmd.py:14)
[2024-01-03 17:54:36,225 -  INFO - fouram]: [CliClient] pod details of release(perf-standalone95800-5-80-8957): 
 I0103 17:54:28.167476     515 request.go:665] Waited for 1.173618705s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/milvus.io/v1beta1?timeout=32s
NAME                                                              READY   STATUS      RESTARTS           AGE     IP              NODE         NOMINATED NODE   READINESS GATES
perf-standalone95800-5-80-8957-etcd-0                             1/1     Running     0                  141m    10.104.19.67    4am-node28   <none>           <none>
perf-standalone95800-5-80-8957-milvus-standalone-6fc69f7bckqdsm   1/1     Running     1 (113m ago)       141m    10.104.28.156   4am-node33   <none>           <none>
perf-standalone95800-5-80-8957-minio-7cfb98ff54-vnggq             1/1     Running     0                  141m    10.104.27.10    4am-node31   <none>           <none>

argo task :

https://argo-workflows.zilliz.cc/workflows/qa/perf-standalone-1-1704295800?tab=workflow&nodeId=perf-standalone-1-1704295800-671032243&nodePanelView=inputs-outputs

client report: "code=1, message=service not ready[standalone=1]: Abnormal" image

Expected Behavior

milvus does not reboot

Steps To Reproduce

1. create a collection or use an existing collection  
  2. build an DISKANN index on the vector column
  3. insert 1m vectors
  4. flush collection
  5. build index on vector column with the same parameters  
  6. count the total number of rows
  7. load collection
  8. execute concurrent query 
        query expr:  'id ' 'in ' '[1, ' '100, ' '1000]' 
  9. step 8 lasts 5h

Milvus Log

No response

Anything else?

No response

elstic avatar Jan 04 '24 04:01 elstic

/assign @congqixia /unassign

yanliang567 avatar Jan 04 '24 11:01 yanliang567

do we have any coredump file for this run? @elstic

congqixia avatar Jan 04 '24 11:01 congqixia

image pod was killed by SIGTERM due to health check failure

	
{"metadata":{"name":"perf-standalone95800-5-80-8957-milvus-standalone-6fc69f7bckqdsm.17a6e1a22dbbdd13","namespace":"qa-milvus","uid":"c0e649ac-930e-4ffc-ab15-4cc00e34a83b","resourceVersion":"555389320","creationTimestamp":"2024-01-03T15:59:55Z","managedFields":[{"manager":"kubelet","operation":"Update","apiVersion":"v1","time":"2024-01-03T15:59:55Z","fieldsType":"FieldsV1","fieldsV1":{"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}}}]},"involvedObject":{"kind":"Pod","namespace":"qa-milvus","name":"perf-standalone95800-5-80-8957-milvus-standalone-6fc69f7bckqdsm","uid":"f62b8a09-bf84-49bd-bf72-25f8e671fc9b","apiVersion":"v1","resourceVersion":"555373580","fieldPath":"spec.containers{standalone}"},"reason":"Unhealthy","message":"Liveness probe failed: HTTP probe failed with statuscode: 500","source":{"component":"kubelet","host":"4am-node33"},"firstTimestamp":"2024-01-03T15:59:55Z","lastTimestamp":"2024-01-03T15:59:55Z","count":1,"type":"Warning","eventTime":null,"reportingComponent":"","reportingInstance":""}

It looks like a known issue for healthz check timeout when pod CPU usage was too high. image image

congqixia avatar Jan 04 '24 11:01 congqixia

BTW standalone was building index during the healthz check failure period

congqixia avatar Jan 04 '24 11:01 congqixia

do we have any coredump file for this run? @elstic

No. It's not on by default. Do you need coredump?

elstic avatar Jan 04 '24 12:01 elstic

@congqixia It may be an occasional problem, it didn't show up in yesterday's nighlty

elstic avatar Jan 05 '24 02:01 elstic

do we have any coredump file for this run? @elstic

No. It's not on by default. Do you need coredump?

@elstic since it's caused by SIGTERM, no coredump is needed Actually it's known issue for healthz check failure when CPU usage is high.

congqixia avatar Jan 05 '24 03:01 congqixia

how long is the health check timeout for now?

xiaofan-luan avatar Jan 05 '24 06:01 xiaofan-luan

should we change to 2 concurrent health check failure?

xiaofan-luan avatar Jan 05 '24 06:01 xiaofan-luan

The issue has come up again.

case: test_concurrent_locust_diskann_dml_dql_filter_standalone

server:

fouram-disk-sta52400-6-77-9926-etcd-0                             1/1     Running                  0                 11m     10.104.23.98    4am-node27   <none>           <none>
fouram-disk-sta52400-6-77-9926-milvus-standalone-87598bd947x9zr   1/1     Running                  2 (108s ago)      11m     10.104.33.145   4am-node36   <none>           <none>
fouram-disk-sta52400-6-77-9926-minio-5d9d7f6448-9zggt             1/1     Running                  0                 11m     10.104.18.79    4am-node25   <none>           <none> (base.py:257)
[2024-01-16 02:15:13,208 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'STATUS|fouram-disk-sta52400-6-77-9926-milvus|fouram-disk-sta52400-6-77-9926-minio|fouram-disk-sta52400-6-77-9926-etcd|fouram-disk-sta52400-6-77-9926-pulsar|fouram-disk-sta52400-6-77-9926-kafka'  (util_cmd.py:14)
[2024-01-16 02:15:22,459 -  INFO - fouram]: [CliClient] pod details of release(fouram-disk-sta52400-6-77-9926): 
 I0116 02:15:14.465335    4077 request.go:665] Waited for 1.154730399s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/events.k8s.io/v1beta1?timeout=32s
NAME                                                              READY   STATUS                   RESTARTS          AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouram-disk-sta52400-6-77-9926-etcd-0                             1/1     Running                  0                 4h12m   10.104.23.209   4am-node27   <none>           <none>
fouram-disk-sta52400-6-77-9926-milvus-standalone-87598bd947x9zr   1/1     Running                  4 (4h5m ago)      5h12m   10.104.33.145   4am-node36   <none>           <none>
fouram-disk-sta52400-6-77-9926-minio-5d9d7f6448-9zggt             1/1     Running                  0                 5h12m   10.104.18.79    4am-node25   <none>           <none> (cli_client.py:132)

cpu overuse, milvus reboot image

elstic avatar Jan 16 '24 03:01 elstic

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jun 15 '24 16:06 stale[bot]

This issue has been fixed.

elstic avatar Jun 17 '24 06:06 elstic