milvus
milvus copied to clipboard
[Bug]: [benchmark] Milvus reboots during continuous concurrent querying.
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:2.3-20240103-6bf46c6f
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
After inserting 1 million vectors and continuing to query, the milvus appeared and restarted.
server
perf-standalone95800-5-80-8957-etcd-0 1/1 Running 0 14m 10.104.19.67 4am-node28 <none> <none>
perf-standalone95800-5-80-8957-milvus-standalone-6fc69f7bckqdsm 1/1 Running 0 14m 10.104.28.156 4am-node33 <none> <none>
perf-standalone95800-5-80-8957-minio-7cfb98ff54-vnggq 1/1 Running 0 14m 10.104.27.10 4am-node31 <none> <none> (base.py:257)
[2024-01-03 17:54:26,915 - INFO - fouram]: [Cmd Exe] kubectl get pods -n qa-milvus -o wide | grep -E 'STATUS|perf-standalone95800-5-80-8957-milvus|perf-standalone95800-5-80-8957-minio|perf-standalone95800-5-80-8957-etcd|perf-standalone95800-5-80-8957-pulsar|perf-standalone95800-5-80-8957-kafka' (util_cmd.py:14)
[2024-01-03 17:54:36,225 - INFO - fouram]: [CliClient] pod details of release(perf-standalone95800-5-80-8957):
I0103 17:54:28.167476 515 request.go:665] Waited for 1.173618705s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/milvus.io/v1beta1?timeout=32s
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
perf-standalone95800-5-80-8957-etcd-0 1/1 Running 0 141m 10.104.19.67 4am-node28 <none> <none>
perf-standalone95800-5-80-8957-milvus-standalone-6fc69f7bckqdsm 1/1 Running 1 (113m ago) 141m 10.104.28.156 4am-node33 <none> <none>
perf-standalone95800-5-80-8957-minio-7cfb98ff54-vnggq 1/1 Running 0 141m 10.104.27.10 4am-node31 <none> <none>
argo task :
https://argo-workflows.zilliz.cc/workflows/qa/perf-standalone-1-1704295800?tab=workflow&nodeId=perf-standalone-1-1704295800-671032243&nodePanelView=inputs-outputs
client report: "code=1, message=service not ready[standalone=1]: Abnormal"
Expected Behavior
milvus does not reboot
Steps To Reproduce
1. create a collection or use an existing collection
2. build an DISKANN index on the vector column
3. insert 1m vectors
4. flush collection
5. build index on vector column with the same parameters
6. count the total number of rows
7. load collection
8. execute concurrent query
query expr: 'id ' 'in ' '[1, ' '100, ' '1000]'
9. step 8 lasts 5h
Milvus Log
No response
Anything else?
No response
/assign @congqixia /unassign
do we have any coredump file for this run? @elstic
pod was killed by SIGTERM due to health check failure
{"metadata":{"name":"perf-standalone95800-5-80-8957-milvus-standalone-6fc69f7bckqdsm.17a6e1a22dbbdd13","namespace":"qa-milvus","uid":"c0e649ac-930e-4ffc-ab15-4cc00e34a83b","resourceVersion":"555389320","creationTimestamp":"2024-01-03T15:59:55Z","managedFields":[{"manager":"kubelet","operation":"Update","apiVersion":"v1","time":"2024-01-03T15:59:55Z","fieldsType":"FieldsV1","fieldsV1":{"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}}}]},"involvedObject":{"kind":"Pod","namespace":"qa-milvus","name":"perf-standalone95800-5-80-8957-milvus-standalone-6fc69f7bckqdsm","uid":"f62b8a09-bf84-49bd-bf72-25f8e671fc9b","apiVersion":"v1","resourceVersion":"555373580","fieldPath":"spec.containers{standalone}"},"reason":"Unhealthy","message":"Liveness probe failed: HTTP probe failed with statuscode: 500","source":{"component":"kubelet","host":"4am-node33"},"firstTimestamp":"2024-01-03T15:59:55Z","lastTimestamp":"2024-01-03T15:59:55Z","count":1,"type":"Warning","eventTime":null,"reportingComponent":"","reportingInstance":""}
It looks like a known issue for healthz check timeout when pod CPU usage was too high.
BTW standalone was building index during the healthz check failure period
do we have any coredump file for this run? @elstic
No. It's not on by default. Do you need coredump?
@congqixia It may be an occasional problem, it didn't show up in yesterday's nighlty
do we have any coredump file for this run? @elstic
No. It's not on by default. Do you need coredump?
@elstic since it's caused by SIGTERM, no coredump is needed Actually it's known issue for healthz check failure when CPU usage is high.
how long is the health check timeout for now?
should we change to 2 concurrent health check failure?
The issue has come up again.
case: test_concurrent_locust_diskann_dml_dql_filter_standalone
server:
fouram-disk-sta52400-6-77-9926-etcd-0 1/1 Running 0 11m 10.104.23.98 4am-node27 <none> <none>
fouram-disk-sta52400-6-77-9926-milvus-standalone-87598bd947x9zr 1/1 Running 2 (108s ago) 11m 10.104.33.145 4am-node36 <none> <none>
fouram-disk-sta52400-6-77-9926-minio-5d9d7f6448-9zggt 1/1 Running 0 11m 10.104.18.79 4am-node25 <none> <none> (base.py:257)
[2024-01-16 02:15:13,208 - INFO - fouram]: [Cmd Exe] kubectl get pods -n qa-milvus -o wide | grep -E 'STATUS|fouram-disk-sta52400-6-77-9926-milvus|fouram-disk-sta52400-6-77-9926-minio|fouram-disk-sta52400-6-77-9926-etcd|fouram-disk-sta52400-6-77-9926-pulsar|fouram-disk-sta52400-6-77-9926-kafka' (util_cmd.py:14)
[2024-01-16 02:15:22,459 - INFO - fouram]: [CliClient] pod details of release(fouram-disk-sta52400-6-77-9926):
I0116 02:15:14.465335 4077 request.go:665] Waited for 1.154730399s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/events.k8s.io/v1beta1?timeout=32s
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
fouram-disk-sta52400-6-77-9926-etcd-0 1/1 Running 0 4h12m 10.104.23.209 4am-node27 <none> <none>
fouram-disk-sta52400-6-77-9926-milvus-standalone-87598bd947x9zr 1/1 Running 4 (4h5m ago) 5h12m 10.104.33.145 4am-node36 <none> <none>
fouram-disk-sta52400-6-77-9926-minio-5d9d7f6448-9zggt 1/1 Running 0 5h12m 10.104.18.79 4am-node25 <none> <none> (cli_client.py:132)
cpu overuse, milvus reboot
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
This issue has been fixed.