milvus [Bug]: The p99 search latency is very high(more than one second) , while the average search latency is only 40 millisecond

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: 2.2.0
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The p99 search latency is very high(more than one second) , while the average search latency is only 40 millisecond

Expected Behavior

The p99 search latency is less then 1 sec

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

Jan 23 '23 05:01 linggong2011

Hi @linggong2011 Thanks for reporting this problem. I was wondering that if you could provide the the Milvus Version you were using? And if there is any metrics (say Prometheus) it would be a great help if you could provide these metrics info.

Jan 23 '23 05:01 congqixia

Also the logs during the high P99 performance might help as well.

Jan 23 '23 05:01 congqixia

@congqixia talking to @linggong2011 offline, he is using milvus 2.2.0. here is the logs logs.tar(1).gz bw_etcd_ALL.230123-055815.bak.gz

Jan 23 '23 07:01 yanliang567

/assign @liliu-z

Jan 23 '23 07:01 yanliang567

Jan 23 '23 11:01 yanliang567

Looks like one of the pods has high latency. Since the pod is not shard leader and the latency from segcore is small, so suspect the problem is from queueing. We checked the task scheduler related params Ready Read Task Length, Unsolved Read Task Length, Parallel Read Task Num, Estimate CPU Usage, all are normal for that pod.

Jan 23 '23 12:01 liliu-z

The bad pod recovered, but another pod get bad with high latency, sounds like a transfer to me 20230123-210257 Check all other metrics, no big discrepancy found for this specific pod.

Jan 23 '23 13:01 liliu-z

might because of certain segment lack of index?

seems that the queueing latency is very high.

Jan 25 '23 02:01 xiaofan-luan

the issue occurs every 5:00am, so it is doubtful that it caused by some hardware resource or bussiness traffic. today the user updated cup ration to 50, the performance improved a lot. img_v2_4cf461b9-ede0-42cb-8a60-a4b29daf846g

Jan 27 '23 02:01 yanliang567

after discussion internal, we believe the root cause is the current nq comaction policy does not working well for IVF_Flat. We agree to improve the policy for different index types. @liliu-z will work on this improvement. /unassign

Jan 30 '23 02:01 yanliang567

/unassign

Feb 01 '23 13:02 liliu-z

/assign @hhy3

We can kick off a standalone with IVFFlat to see if this is reproducible in-house.

Feb 02 '23 02:02 liliu-z

/assign @liliu-z

Feb 08 '23 11:02 hhy3

Checked offline. Noticed that some CPU throttling occurred in the pod with high P99. User deploys their pod on a 64 vCPUs host, which leads to NUMA problem. We now highly suspect this is the root cause of the unpredictable P99.

Feb 08 '23 11:02 liliu-z

Solved by changing host to 16 vCPU. OK to close

Feb 13 '23 05:02 liliu-z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

Mar 17 '23 06:03 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

Aug 02 '23 05:08 stale[bot]

close as comments above

Aug 14 '23 02:08 yanliang567

milvus milvus copied to clipboard

[Bug]: The p99 search latency is very high(more than one second) , while the average search latency is only 40 millisecond

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus
milvus copied to clipboard