milvus [Bug]: Flush performance degrade for the collection created during chaos after datacoord pod recovered from pod kill

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: master-20220321-2078b24d
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2):2.0.2.dev5
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Get collection entities cost 15.4309 seconds

check collection CreateChecker__Yfx24DUW
collection exists

Create collection...

Insert 3000 vectors cost 0.5203 seconds

Get collection entities...
3000

Get collection entities cost 15.4309 seconds

Expected Behavior

Get collection entities cost 3.7377 seconds

check collection Checker__m9FOw5zU
collection exists

Create collection...
Insert 3000 vectors cost 0.4555 seconds

Get collection entities...
8200

Get collection entities cost 3.7377 seconds

Steps To Reproduce

see https://github.com/milvus-io/milvus/runs/5632714245?check_suite_focus=true

Anything else?

failed job: https://github.com/milvus-io/milvus/runs/5632714245?check_suite_focus=true logs: https://github.com/milvus-io/milvus/suites/5742511778/artifacts/190470457

Mar 22 '22 02:03 zhuwenxing

@zhuwenxing does this performance degrade persist? I am asking because I want to make sure it is not because of the first time it flushes

Mar 22 '22 02:03 yanliang567

@zhuwenxing does this performance degrade persist? I am asking because I want to make sure it is not because of the first time it flushes

Not sure, need more tests. These collections all are created during the chaos and empty. But why does it matter if it was the first time it flushes?

Mar 22 '22 06:03 zhuwenxing

https://github.com/milvus-io/milvus/runs/5863529511?check_suite_focus=true This issue still exists. From the observation, flush performance degrades mainly happened in collections that were with the prefix CreateChecker. Collections with the prefix CreateChecker were created during the chaos and they were empty collections and had never been flushed before.

Apr 11 '22 09:04 zhuwenxing

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

May 11 '22 11:05 stale[bot]

keep it open

May 19 '22 00:05 zhuwenxing

/assign @soothing-rain since it all about flush~

May 19 '22 01:05 xiaofan-luan

I tried with latest version master-20220518-b9568177, this issue still here pipeline:http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/5103/pipeline/209 log: artifacts-datacoord-pod-kill-5103-server-logs.tar.gz

May 19 '22 02:05 zhuwenxing

@soothing-rain any updates

Jun 02 '22 08:06 yanliang567

@soothing-rain any updates

Nope. It is not a P0 issue, should it be escalated?

Jun 02 '22 08:06 soothing-rain

It also happened for minio pod kill This job is failed due to timeout, because flush of collection which prefix is CreateChecker cost a lot of time. https://github.com/zhuwenxing/milvus/runs/6885879889?check_suite_focus=true

Jun 15 '22 06:06 zhuwenxing

/unassign /assign @wayblink

Jun 15 '22 07:06 soothing-rain

@wayblink shall we close this issue?

Jun 20 '22 03:06 xiaofan-luan

@wayblink shall we close this issue?

Let's keep it, still explore it.

Jun 20 '22 07:06 wayblink

failed job: https://github.com/zhuwenxing/milvus/runs/6978232289?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7017209278/artifacts/275695453

Jun 21 '22 06:06 zhuwenxing

This issue also happens when Milvus uses kafka as MQ, and the performance degradation is more significant.

failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test-kafka/detail/chaos-test-kafka/75/pipeline log:artifacts-datacoord-pod-kill-75-server-logs.tar.gz

Jun 22 '22 06:06 zhuwenxing

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

Jul 22 '22 14:07 stale[bot]

/reopen

Sep 15 '22 06:09 zhuwenxing

@zhuwenxing: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 15 '22 06:09 sre-ci-robot

@zhuwenxing any new test cases？

Sep 15 '22 06:09 wayblink

for datacoord pod kill, this issue still exists

Kafka as MQ, chaos type: pod-kill image tag: 2.1.0-20220913-3c3ba55 target pod: datacoord failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-for-release/detail/chaos-test-kafka-for-release/426/pipeline log artifacts-datacoord-pod-kill-426-server-logs.tar.gz

artifacts-datacoord-pod-kill-426-pytest-logs.tar.gz

for the collection with the prefix CreateChecker, the flush time is much longer than other collections

Same for pulsar version pulsar, chaos type: pod-kill image tag: 2.1.0-20220913-3c3ba55 target pod: datacoord failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release/detail/chaos-test-for-release/540/pipeline

Sep 15 '22 06:09 zhuwenxing

@zhuwenxing any new test cases？

I think this issue has not been fixed as far as I know, so it is not about the new test cases. I reopen it just because it is closed by the stable bot, however, the issue is not fixed yet.

Sep 15 '22 06:09 zhuwenxing

chaos type: pod-failure image tag: 2.1.0-20220921-a0ab90ea target pod: datacoord failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-for-release/detail/chaos-test-kafka-for-release/596/pipeline log: artifacts-datacoord-pod-failure-596-pytest-logs.tar.gz artifacts-datacoord-pod-failure-596-server-logs.tar.gz

Sep 22 '22 02:09 zhuwenxing

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

Oct 22 '22 04:10 stale[bot]

keep it open

Nov 02 '22 01:11 xiaofan-luan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

Dec 02 '22 04:12 stale[bot]

@zhuwenxing is it still a valid issue?

Dec 04 '22 09:12 yanliang567

It was not reproduced in master-20221219-856bceec https://github.com/zhuwenxing/milvus/actions/runs/3737035097/jobs/6341904578

Dec 20 '22 03:12 zhuwenxing

milvus milvus copied to clipboard

[Bug]: Flush performance degrade for the collection created during chaos after datacoord pod recovered from pod kill

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Anything else?

milvus
milvus copied to clipboard