milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Flush performance degrade for the collection created during chaos after datacoord pod recovered from pod kill

Open zhuwenxing opened this issue 2 years ago • 22 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: master-20220321-2078b24d
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2):2.0.2.dev5
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Get collection entities cost 15.4309 seconds

check collection CreateChecker__Yfx24DUW
collection exists

Create collection...

Insert 3000 vectors cost 0.5203 seconds

Get collection entities...
3000

Get collection entities cost 15.4309 seconds

Expected Behavior

Get collection entities cost 3.7377 seconds

check collection Checker__m9FOw5zU
collection exists

Create collection...
Insert 3000 vectors cost 0.4555 seconds

Get collection entities...
8200

Get collection entities cost 3.7377 seconds

Steps To Reproduce

see https://github.com/milvus-io/milvus/runs/5632714245?check_suite_focus=true

Anything else?

failed job: https://github.com/milvus-io/milvus/runs/5632714245?check_suite_focus=true logs: https://github.com/milvus-io/milvus/suites/5742511778/artifacts/190470457

zhuwenxing avatar Mar 22 '22 02:03 zhuwenxing

@zhuwenxing does this performance degrade persist? I am asking because I want to make sure it is not because of the first time it flushes

yanliang567 avatar Mar 22 '22 02:03 yanliang567

@zhuwenxing does this performance degrade persist? I am asking because I want to make sure it is not because of the first time it flushes

Not sure, need more tests. These collections all are created during the chaos and empty. But why does it matter if it was the first time it flushes?

zhuwenxing avatar Mar 22 '22 06:03 zhuwenxing

https://github.com/milvus-io/milvus/runs/5863529511?check_suite_focus=true This issue still exists. From the observation, flush performance degrades mainly happened in collections that were with the prefix CreateChecker. Collections with the prefix CreateChecker were created during the chaos and they were empty collections and had never been flushed before.

zhuwenxing avatar Apr 11 '22 09:04 zhuwenxing

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar May 11 '22 11:05 stale[bot]

keep it open

zhuwenxing avatar May 19 '22 00:05 zhuwenxing

/assign @soothing-rain since it all about flush~

xiaofan-luan avatar May 19 '22 01:05 xiaofan-luan

I tried with latest version master-20220518-b9568177, this issue still here image pipeline:http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/5103/pipeline/209 log: artifacts-datacoord-pod-kill-5103-server-logs.tar.gz

zhuwenxing avatar May 19 '22 02:05 zhuwenxing

@soothing-rain any updates

yanliang567 avatar Jun 02 '22 08:06 yanliang567

@soothing-rain any updates

Nope. It is not a P0 issue, should it be escalated?

soothing-rain avatar Jun 02 '22 08:06 soothing-rain

It also happened for minio pod kill This job is failed due to timeout, because flush of collection which prefix is CreateChecker cost a lot of time. https://github.com/zhuwenxing/milvus/runs/6885879889?check_suite_focus=true

image

image

zhuwenxing avatar Jun 15 '22 06:06 zhuwenxing

/unassign /assign @wayblink

soothing-rain avatar Jun 15 '22 07:06 soothing-rain

@wayblink shall we close this issue?

xiaofan-luan avatar Jun 20 '22 03:06 xiaofan-luan

@wayblink shall we close this issue?

Let's keep it, still explore it.

wayblink avatar Jun 20 '22 07:06 wayblink

failed job: https://github.com/zhuwenxing/milvus/runs/6978232289?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7017209278/artifacts/275695453

zhuwenxing avatar Jun 21 '22 06:06 zhuwenxing

This issue also happens when Milvus uses kafka as MQ, and the performance degradation is more significant.

failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test-kafka/detail/chaos-test-kafka/75/pipeline log:artifacts-datacoord-pod-kill-75-server-logs.tar.gz

zhuwenxing avatar Jun 22 '22 06:06 zhuwenxing

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jul 22 '22 14:07 stale[bot]

/reopen

zhuwenxing avatar Sep 15 '22 06:09 zhuwenxing

@zhuwenxing: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sre-ci-robot avatar Sep 15 '22 06:09 sre-ci-robot

@zhuwenxing any new test cases?

wayblink avatar Sep 15 '22 06:09 wayblink

for datacoord pod kill, this issue still exists

Kafka as MQ, chaos type: pod-kill image tag: 2.1.0-20220913-3c3ba55 target pod: datacoord failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-for-release/detail/chaos-test-kafka-for-release/426/pipeline log artifacts-datacoord-pod-kill-426-server-logs.tar.gz

artifacts-datacoord-pod-kill-426-pytest-logs.tar.gz

image

image

for the collection with the prefix CreateChecker, the flush time is much longer than other collections

Same for pulsar version pulsar, chaos type: pod-kill image tag: 2.1.0-20220913-3c3ba55 target pod: datacoord failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release/detail/chaos-test-for-release/540/pipeline

zhuwenxing avatar Sep 15 '22 06:09 zhuwenxing

@zhuwenxing any new test cases?

I think this issue has not been fixed as far as I know, so it is not about the new test cases. I reopen it just because it is closed by the stable bot, however, the issue is not fixed yet.

zhuwenxing avatar Sep 15 '22 06:09 zhuwenxing

chaos type: pod-failure image tag: 2.1.0-20220921-a0ab90ea target pod: datacoord failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-for-release/detail/chaos-test-kafka-for-release/596/pipeline log: artifacts-datacoord-pod-failure-596-pytest-logs.tar.gz artifacts-datacoord-pod-failure-596-server-logs.tar.gz

zhuwenxing avatar Sep 22 '22 02:09 zhuwenxing

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Oct 22 '22 04:10 stale[bot]

keep it open

xiaofan-luan avatar Nov 02 '22 01:11 xiaofan-luan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Dec 02 '22 04:12 stale[bot]

@zhuwenxing is it still a valid issue?

yanliang567 avatar Dec 04 '22 09:12 yanliang567

It was not reproduced in master-20221219-856bceec https://github.com/zhuwenxing/milvus/actions/runs/3737035097/jobs/6341904578 image

zhuwenxing avatar Dec 20 '22 03:12 zhuwenxing