milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Milvus deploy may failed due to Minio status not ready

Open zhuwenxing opened this issue 1 year ago • 4 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:2.2.0-20230309-130ab6da
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

image

components of Milvus keep restart

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

image tag: 2.2.0-20230309-130ab6da failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/2576/pipeline/ log: artifacts-querynode-pod-kill-2576-server-logs (1).tar.gz

Anything else?

No response

zhuwenxing avatar Mar 10 '23 03:03 zhuwenxing

/assign @LoveEachDay Please take a look

zhuwenxing avatar Mar 10 '23 03:03 zhuwenxing

image failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/2857/pipeline log: artifacts-proxy-pod-failure-2857-server-logs (1).tar.gz

API: SYSTEM()
Time: 21:47:55 UTC 03/20/2023
Error: Marking http://proxy-pod-failure-2857-minio-3.proxy-pod-failure-2857-minio-svc.chaos-testing.svc.cluster.local:9000/minio/storage/export/v43 temporary offline; caused by Post "http://proxy-pod-failure-2857-minio-3.proxy-pod-failure-2857-minio-svc.chaos-testing.svc.cluster.local:9000/minio/storage/export/v43/readall?disk-id=&file-path=format.json&volume=.minio.sys": lookup proxy-pod-failure-2857-minio-3.proxy-pod-failure-2857-minio-svc.chaos-testing.svc.cluster.local on 10.101.0.10:53: no such host (*fmt.wrapError)
       6: internal/rest/client.go:151:rest.(*Client).Call()
       5: cmd/storage-rest-client.go:152:cmd.(*storageRESTClient).call()
       4: cmd/storage-rest-client.go:520:cmd.(*storageRESTClient).ReadAll()
       3: cmd/format-erasure.go:387:cmd.loadFormatErasure()
       2: cmd/format-erasure.go:326:cmd.loadFormatErasureAll.func1()
       1: internal/sync/errgroup/errgroup.go:123:errgroup.(*Group).Go.func1()
Waiting for all other servers to be online to format the disks (elapses 2m59s)


API: SYSTEM()
Time: 21:47:55 UTC 03/20/2023
Error: Marking http://proxy-pod-failure-2857-minio-3.proxy-pod-failure-2857-minio-svc.chaos-testing.svc.cluster.local:9000/minio/storage/export/v43 temporary offline; caused by Post "http://proxy-pod-failure-2857-minio-3.proxy-pod-failure-2857-minio-svc.chaos-testing.svc.cluster.local:9000/minio/storage/export/v43/readall?disk-id=&file-path=format.json&volume=.minio.sys": lookup proxy-pod-failure-2857-minio-3.proxy-pod-failure-2857-minio-svc.chaos-testing.svc.cluster.local on 10.101.0.10:53: no such host (*fmt.wrapError)
       6: internal/rest/client.go:151:rest.(*Client).Call()
       5: cmd/storage-rest-client.go:152:cmd.(*storageRESTClient).call()
       4: cmd/storage-rest-client.go:520:cmd.(*storageRESTClient).ReadAll()
       3: cmd/format-erasure.go:387:cmd.loadFormatErasure()
       2: cmd/format-erasure.go:326:cmd.loadFormatErasureAll.func1()
       1: internal/sync/errgroup/errgroup.go:123:errgroup.(*Group).Go.func1()
Waiting for all other servers to be online to format the disks (elapses 2m59s)

zhuwenxing avatar Mar 21 '23 03:03 zhuwenxing

At case 2857, minio started success after 47:55 based on logs, but the datacoord has already exhausted retry times at 46:43, that's why the datacoord start failed image

locustbaby avatar Mar 22 '23 02:03 locustbaby

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Apr 21 '23 06:04 stale[bot]