milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: The p99 search latency is very high(more than one second) , while the average search latency is only 40 millisecond

Open linggong2011 opened this issue 2 years ago • 17 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: 2.2.0
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The p99 search latency is very high(more than one second) , while the average search latency is only 40 millisecond

Expected Behavior

The p99 search latency is less then 1 sec

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

linggong2011 avatar Jan 23 '23 05:01 linggong2011

Hi @linggong2011 Thanks for reporting this problem. I was wondering that if you could provide the the Milvus Version you were using? And if there is any metrics (say Prometheus) it would be a great help if you could provide these metrics info.

congqixia avatar Jan 23 '23 05:01 congqixia

Also the logs during the high P99 performance might help as well.

congqixia avatar Jan 23 '23 05:01 congqixia

@congqixia talking to @linggong2011 offline, he is using milvus 2.2.0. here is the logs logs.tar(1).gz bw_etcd_ALL.230123-055815.bak.gz

yanliang567 avatar Jan 23 '23 07:01 yanliang567

image image image

yanliang567 avatar Jan 23 '23 07:01 yanliang567

/assign @liliu-z

yanliang567 avatar Jan 23 '23 07:01 yanliang567

image image

yanliang567 avatar Jan 23 '23 11:01 yanliang567

Looks like one of the pods has high latency. Since the pod is not shard leader and the latency from segcore is small, so suspect the problem is from queueing. We checked the task scheduler related params Ready Read Task Length, Unsolved Read Task Length, Parallel Read Task Num, Estimate CPU Usage, all are normal for that pod.

liliu-z avatar Jan 23 '23 12:01 liliu-z

The bad pod recovered, but another pod get bad with high latency, sounds like a transfer to me 20230123-210257 Check all other metrics, no big discrepancy found for this specific pod.

liliu-z avatar Jan 23 '23 13:01 liliu-z

might because of certain segment lack of index?

seems that the queueing latency is very high.

xiaofan-luan avatar Jan 25 '23 02:01 xiaofan-luan

the issue occurs every 5:00am, so it is doubtful that it caused by some hardware resource or bussiness traffic. today the user updated cup ration to 50, the performance improved a lot. img_v2_4cf461b9-ede0-42cb-8a60-a4b29daf846g

yanliang567 avatar Jan 27 '23 02:01 yanliang567

after discussion internal, we believe the root cause is the current nq comaction policy does not working well for IVF_Flat. We agree to improve the policy for different index types. @liliu-z will work on this improvement. /unassign

yanliang567 avatar Jan 30 '23 02:01 yanliang567

/unassign

liliu-z avatar Feb 01 '23 13:02 liliu-z

/assign @hhy3

We can kick off a standalone with IVFFlat to see if this is reproducible in-house.

liliu-z avatar Feb 02 '23 02:02 liliu-z

/assign @liliu-z

hhy3 avatar Feb 08 '23 11:02 hhy3

Checked offline. Noticed that some CPU throttling occurred in the pod with high P99. User deploys their pod on a 64 vCPUs host, which leads to NUMA problem. We now highly suspect this is the root cause of the unpredictable P99.

liliu-z avatar Feb 08 '23 11:02 liliu-z

Solved by changing host to 16 vCPU. OK to close

liliu-z avatar Feb 13 '23 05:02 liliu-z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Mar 17 '23 06:03 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Aug 02 '23 05:08 stale[bot]

close as comments above

yanliang567 avatar Aug 14 '23 02:08 yanliang567