milvus
milvus copied to clipboard
[Bug]: Flush performance degrade for the collection created during chaos after datacoord pod recovered from pod kill
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: master-20220321-2078b24d
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2):2.0.2.dev5
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
Get collection entities cost 15.4309 seconds
check collection CreateChecker__Yfx24DUW
collection exists
Create collection...
Insert 3000 vectors cost 0.5203 seconds
Get collection entities...
3000
Get collection entities cost 15.4309 seconds
Expected Behavior
Get collection entities cost 3.7377 seconds
check collection Checker__m9FOw5zU
collection exists
Create collection...
Insert 3000 vectors cost 0.4555 seconds
Get collection entities...
8200
Get collection entities cost 3.7377 seconds
Steps To Reproduce
see https://github.com/milvus-io/milvus/runs/5632714245?check_suite_focus=true
Anything else?
failed job: https://github.com/milvus-io/milvus/runs/5632714245?check_suite_focus=true logs: https://github.com/milvus-io/milvus/suites/5742511778/artifacts/190470457
@zhuwenxing does this performance degrade persist? I am asking because I want to make sure it is not because of the first time it flushes
@zhuwenxing does this performance degrade persist? I am asking because I want to make sure it is not because of the first time it flushes
Not sure, need more tests. These collections all are created during the chaos and empty. But why does it matter if it was the first time it flushes?
https://github.com/milvus-io/milvus/runs/5863529511?check_suite_focus=true
This issue still exists. From the observation, flush performance degrades mainly happened in collections that were with the prefix CreateChecker
. Collections with the prefix CreateChecker
were created during the chaos and they were empty collections and had never been flushed before.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
keep it open
/assign @soothing-rain since it all about flush~
I tried with latest version master-20220518-b9568177
, this issue still here
pipeline:http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/5103/pipeline/209
log:
artifacts-datacoord-pod-kill-5103-server-logs.tar.gz
@soothing-rain any updates
@soothing-rain any updates
Nope. It is not a P0 issue, should it be escalated?
It also happened for minio pod kill This job is failed due to timeout, because flush of collection which prefix is CreateChecker cost a lot of time. https://github.com/zhuwenxing/milvus/runs/6885879889?check_suite_focus=true
/unassign /assign @wayblink
@wayblink shall we close this issue?
@wayblink shall we close this issue?
Let's keep it, still explore it.
failed job: https://github.com/zhuwenxing/milvus/runs/6978232289?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7017209278/artifacts/275695453
This issue also happens when Milvus uses kafka as MQ, and the performance degradation is more significant.
failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test-kafka/detail/chaos-test-kafka/75/pipeline log:artifacts-datacoord-pod-kill-75-server-logs.tar.gz
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
/reopen
@zhuwenxing: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@zhuwenxing any new test cases?
for datacoord pod kill, this issue still exists
Kafka as MQ, chaos type: pod-kill image tag: 2.1.0-20220913-3c3ba55 target pod: datacoord failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-for-release/detail/chaos-test-kafka-for-release/426/pipeline log artifacts-datacoord-pod-kill-426-server-logs.tar.gz
artifacts-datacoord-pod-kill-426-pytest-logs.tar.gz
for the collection with the prefix CreateChecker, the flush time is much longer than other collections
Same for pulsar version pulsar, chaos type: pod-kill image tag: 2.1.0-20220913-3c3ba55 target pod: datacoord failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release/detail/chaos-test-for-release/540/pipeline
@zhuwenxing any new test cases?
I think this issue has not been fixed as far as I know, so it is not about the new test cases. I reopen it just because it is closed by the stable bot, however, the issue is not fixed yet.
chaos type: pod-failure image tag: 2.1.0-20220921-a0ab90ea target pod: datacoord failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-for-release/detail/chaos-test-kafka-for-release/596/pipeline log: artifacts-datacoord-pod-failure-596-pytest-logs.tar.gz artifacts-datacoord-pod-failure-596-server-logs.tar.gz
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
keep it open
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
@zhuwenxing is it still a valid issue?
It was not reproduced in master-20221219-856bceec
https://github.com/zhuwenxing/milvus/actions/runs/3737035097/jobs/6341904578