milvus search timeout for a while

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version:2.2.2
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): centos
- CPU/Memory: 6c32gb
- GPU: 
- Others:

Current Behavior

The search is timeout for 3 mins when my cluster reach the business peak period.

Here is the Search Latency dashboard:
Here are parts of querynode metrics and Search Segment Latency is quite well:
Here is Search Group NQ dashboard which is shaked at the same time:

This cluster is my online environment and I hope to providing a solution and reasons as soon as possible.Thanks a lot~

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

Uploading milvus.log.tar.gz… Uploading milvus.log.tar.gz…

Anything else?

No response

May 12 '23 04:05 dzqoo

Here are parts of runtime metrics

May 12 '23 04:05 dzqoo

Here are parts of runtime metrics

If you look at your metrics, seems that queue time is very large, which triggered search merge logic.

Did you have chance to try 2.2.8? I remember we have some fix on the search merge logic

May 12 '23 06:05 xiaofan-luan

Here are parts of runtime metrics

If you look at your metrics, seems that queue time is very large, which triggered search merge logic.

Did you have chance to try 2.2.8? I remember we have some fix on the search merge logic Can I know which specific merge logic is causing it?

May 12 '23 06:05 dzqoo

/assign @liliu-z

May 12 '23 06:05 yanliang567

I encountered this problem again today .><. Search is timeout for 2 min and at the same while, the "search reduce latency" metrics is quite high as you see in the following: I wonder to know if this problem has been resolved in the latest version（2.2.8 or 2.2.9). Looking forward to your reply~

Jun 06 '23 14:06 dzqoo

I found that at the same time,the goroutine of this querynode is pretty high as you see:

Jun 30 '23 07:06 dzqoo

I encountered this issue again when I upgraded to version 2.2.9...

Jul 10 '23 02:07 dzqoo

any insertion or delete requests at that time? could you please share the cpu usage of querynodes

Jul 10 '23 02:07 yanliang567

I have no insertion or delete requests at the same time. The cpu usage is showed as you see in the pic.

Jul 10 '23 02:07 dzqoo

1aa1232538f16f1c53aaf8d0c69d2a2 "Search Group NQ" at the same time.

Jul 10 '23 07:07 dzqoo

seems that the nq group size increased during the business peak period. a known issue that we are trying to fix? @liliu-z

Jul 10 '23 08:07 yanliang567

Hi @dzqoo , Looks like that we don't have any QPS bumping during this time, and from the info we have right now, I don't have too much ideas. Can you help screenshot all graph looks abnormal during that time? Also can you check if there is any growing segment that time? Thanks

Jul 11 '23 03:07 liliu-z

More info is needed, especially for querynode. Also resource usage graph you provided, like CPU, Memeory, I didn't querynode inside them. Can you help provide some infos about this? Thanks

Jul 11 '23 03:07 liliu-z

1689063686418

Jul 11 '23 08:07 dzqoo

There are all metrics.FYI~

Jul 11 '23 08:07 dzqoo

@dzqoo Can I get more infos like:

How much data we have and in what dimension?
How many querynode we have and what type are they?
How many collections we have?
How many segments for each collections and each querynode. From the grafana page all data stack together and make it hard to tell. Thanks

Jul 12 '23 06:07 liliu-z

@dzqoo Can I get more infos like:

How much data we have and in what dimension?

How many querynode we have and what type are they?

How many collections we have?

How many segments for each collections and each querynode. From the grafana page all data stack together and make it hard to tell. Thanks

We have ~1 billion data in about 10 collections which dimensions are range from 128 to 768;
We have 15 querynode and 32gb6c for each querynode;
the segments in each querynode is showed as you see: @liliu-z

Jul 12 '23 07:07 dzqoo

@dzqoo Can I have the segment distribution of the specific collection you queried? Thanks! And also what dim is that collection and how many rows.

Jul 13 '23 09:07 liliu-z

@dzqoo Can I have the segment distribution of the specific collection you queried? Thanks! And also what dim is that collection and how many rows.

All tables have timed out retrieval... @liliu-z

Jul 14 '23 02:07 dzqoo

could you try the latest version? I fix many performance issues in the last 6 months. We'd like to init a meeting if the newest version still not stable, but we probably don't want to spend time on investigating staled issues.

Jul 15 '23 09:07 xiaofan-luan

2.2.11 should be at least 50% faster to 2.2.2 and we solved many balanced issues and stability issues

Jul 15 '23 09:07 xiaofan-luan

Thank you for your reply~And when does the version 2.3 come out？I want to go directly to version 2.3.

Jul 17 '23 02:07 dzqoo

I will be released next week ~

Jul 17 '23 09:07 xiaofan-luan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

Sep 04 '23 06:09 stale[bot]

milvus milvus copied to clipboard

search timeout for a while

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus
milvus copied to clipboard