milvus
milvus copied to clipboard
[Bug]: The p99 search latency is very high(more than one second) , while the average search latency is only 40 millisecond
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: 2.2.0
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
The p99 search latency is very high(more than one second) , while the average search latency is only 40 millisecond
Expected Behavior
The p99 search latency is less then 1 sec
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response
Hi @linggong2011 Thanks for reporting this problem. I was wondering that if you could provide the the Milvus Version you were using? And if there is any metrics (say Prometheus) it would be a great help if you could provide these metrics info.
Also the logs during the high P99 performance might help as well.
@congqixia talking to @linggong2011 offline, he is using milvus 2.2.0. here is the logs logs.tar(1).gz bw_etcd_ALL.230123-055815.bak.gz
/assign @liliu-z
Looks like one of the pods has high latency. Since the pod is not shard leader and the latency from segcore is small, so suspect the problem is from queueing.
We checked the task scheduler related params Ready Read Task Length, Unsolved Read Task Length, Parallel Read Task Num, Estimate CPU Usage
, all are normal for that pod.
The bad pod recovered, but another pod get bad with high latency, sounds like a transfer to me
Check all other metrics, no big discrepancy found for this specific pod.
might because of certain segment lack of index?
seems that the queueing latency is very high.
the issue occurs every 5:00am, so it is doubtful that it caused by some hardware resource or bussiness traffic.
today the user updated cup ration to 50, the performance improved a lot.
after discussion internal, we believe the root cause is the current nq comaction policy does not working well for IVF_Flat. We agree to improve the policy for different index types. @liliu-z will work on this improvement. /unassign
/unassign
/assign @hhy3
We can kick off a standalone with IVFFlat to see if this is reproducible in-house.
/assign @liliu-z
Checked offline. Noticed that some CPU throttling occurred in the pod with high P99. User deploys their pod on a 64 vCPUs host, which leads to NUMA problem. We now highly suspect this is the root cause of the unpredictable P99.
Solved by changing host to 16 vCPU. OK to close
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
close as comments above