milvus icon indicating copy to clipboard operation
milvus copied to clipboard

search timeout for a while

Open dzqoo opened this issue 1 year ago • 4 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:2.2.2
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): centos
- CPU/Memory: 6c32gb
- GPU: 
- Others:

Current Behavior

The search is timeout for 3 mins when my cluster reach the business peak period.

  1. Here is the Search Latency dashboard: image
  2. Here are parts of querynode metrics and Search Segment Latency is quite well: image
  3. Here is Search Group NQ dashboard which is shaked at the same time: image

This cluster is my online environment and I hope to providing a solution and reasons as soon as possible.Thanks a lot~

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

Uploading milvus.log.tar.gz… Uploading milvus.log.tar.gz…

Anything else?

No response

dzqoo avatar May 12 '23 04:05 dzqoo

  1. Here are parts of runtime metrics image image

dzqoo avatar May 12 '23 04:05 dzqoo

  1. Here are parts of runtime metrics image image

If you look at your metrics, seems that queue time is very large, which triggered search merge logic.

Did you have chance to try 2.2.8? I remember we have some fix on the search merge logic

xiaofan-luan avatar May 12 '23 06:05 xiaofan-luan

  1. Here are parts of runtime metrics image image

If you look at your metrics, seems that queue time is very large, which triggered search merge logic.

Did you have chance to try 2.2.8? I remember we have some fix on the search merge logic Can I know which specific merge logic is causing it?

dzqoo avatar May 12 '23 06:05 dzqoo

/assign @liliu-z

yanliang567 avatar May 12 '23 06:05 yanliang567

I encountered this problem again today .><. Search is timeout for 2 min and at the same while, the "search reduce latency" metrics is quite high as you see in the following: image I wonder to know if this problem has been resolved in the latest version(2.2.8 or 2.2.9). Looking forward to your reply~

dzqoo avatar Jun 06 '23 14:06 dzqoo

I found that at the same time,the goroutine of this querynode is pretty high as you see: image

dzqoo avatar Jun 30 '23 07:06 dzqoo

image I encountered this issue again when I upgraded to version 2.2.9...

dzqoo avatar Jul 10 '23 02:07 dzqoo

any insertion or delete requests at that time? could you please share the cpu usage of querynodes

yanliang567 avatar Jul 10 '23 02:07 yanliang567

image image image I have no insertion or delete requests at the same time. The cpu usage is showed as you see in the pic.

dzqoo avatar Jul 10 '23 02:07 dzqoo

1aa1232538f16f1c53aaf8d0c69d2a2 "Search Group NQ" at the same time.

dzqoo avatar Jul 10 '23 07:07 dzqoo

seems that the nq group size increased during the business peak period. a known issue that we are trying to fix? @liliu-z

yanliang567 avatar Jul 10 '23 08:07 yanliang567

Hi @dzqoo , Looks like that we don't have any QPS bumping during this time, and from the info we have right now, I don't have too much ideas. Can you help screenshot all graph looks abnormal during that time? Also can you check if there is any growing segment that time? Thanks

liliu-z avatar Jul 11 '23 03:07 liliu-z

More info is needed, especially for querynode. Also resource usage graph you provided, like CPU, Memeory, I didn't querynode inside them. Can you help provide some infos about this? Thanks

liliu-z avatar Jul 11 '23 03:07 liliu-z

1689063686418 image image image image image image image image image image image

dzqoo avatar Jul 11 '23 08:07 dzqoo

1689063686418 image image image image image image image image image image image

There are all metrics.FYI~

dzqoo avatar Jul 11 '23 08:07 dzqoo

@dzqoo Can I get more infos like:

  1. How much data we have and in what dimension?
  2. How many querynode we have and what type are they?
  3. How many collections we have?
  4. How many segments for each collections and each querynode. From the grafana page all data stack together and make it hard to tell. Thanks

liliu-z avatar Jul 12 '23 06:07 liliu-z

@dzqoo Can I get more infos like:

  1. How much data we have and in what dimension?
  2. How many querynode we have and what type are they?
  3. How many collections we have?
  4. How many segments for each collections and each querynode. From the grafana page all data stack together and make it hard to tell. Thanks
  1. We have ~1 billion data in about 10 collections which dimensions are range from 128 to 768;
  2. We have 15 querynode and 32gb6c for each querynode;
  3. the segments in each querynode is showed as you see: image @liliu-z

dzqoo avatar Jul 12 '23 07:07 dzqoo

@dzqoo Can I have the segment distribution of the specific collection you queried? Thanks! And also what dim is that collection and how many rows.

liliu-z avatar Jul 13 '23 09:07 liliu-z

@dzqoo Can I have the segment distribution of the specific collection you queried? Thanks! And also what dim is that collection and how many rows.

All tables have timed out retrieval... @liliu-z

dzqoo avatar Jul 14 '23 02:07 dzqoo

could you try the latest version? I fix many performance issues in the last 6 months. We'd like to init a meeting if the newest version still not stable, but we probably don't want to spend time on investigating staled issues.

xiaofan-luan avatar Jul 15 '23 09:07 xiaofan-luan

2.2.11 should be at least 50% faster to 2.2.2 and we solved many balanced issues and stability issues

xiaofan-luan avatar Jul 15 '23 09:07 xiaofan-luan

Thank you for your reply~And when does the version 2.3 come out?I want to go directly to version 2.3.

dzqoo avatar Jul 17 '23 02:07 dzqoo

I will be released next week ~

xiaofan-luan avatar Jul 17 '23 09:07 xiaofan-luan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Sep 04 '23 06:09 stale[bot]