milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [benchmark][cluster][LRU] Search get stucked when handoff segments if memory usage of QueryNode is high.

Open chyezh opened this issue 9 months ago • 2 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: lru-dev-fabf2da87fd2310e7e2ee76992d7ebacbdffc93e
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0rc66
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

https://grafana-4am.zilliz.cc/d/uLf5cJ3Ga/milvus2-0?orgId=1&var-datasource=P1809F7CD0C75ACF3&var-cluster=&var-namespace=qa-milvus&var-instance=lru-verify-32513-yezhen&var-collection=All&var-app_name=milvus&from=1713949871696&to=1713953123466

Search get stucked when handoff segments if memory usage of QueryNode is high. image

Handoff get stucked at same time. image

Expected Behavior

should not get stuck.

Steps To Reproduce

See #32513

Milvus Log

No response

Anything else?

No response

chyezh avatar Apr 28 '24 06:04 chyezh

  • When handoff happens, new segment search will trigger load operation in lru-cache.
  • request resource for loading fails if Query node may be OOM.
[2024/04/24 09:34:21.414 +00:00] [WARN] [segments/segment_loader.go:819] ["no sufficient resource to load segments"] [traceID=e8995d1b669c5d25a412f289676bc9d1] [segmentIDs="[449296678400908743]"] [error="load segment failed, OOM if load, maxSegmentSize = 122.97617435455322 MB,  memUsage = 7274.746287345886 MB, predictMemUsage = 7397.722461700439 MB, totalMem = 8192 MB thresholdFactor = 0.900000"]
  • Then the searched segment is set with a notifier in a waitQueue for next check.
  • All search task get stucked with same reason, then the concurrent limit of query node scheduler is reached.
  • No more search task can be executed, and the segments reference count on delegator for search task can not be release, then the release step of handoff cannot be executed.

chyezh avatar Apr 28 '24 08:04 chyezh

/assign @chyezh /unassign

yanliang567 avatar Apr 28 '24 09:04 yanliang567

Already fixed by add fast-fail and timeout of request resource of lazyload. Can be controlled by queryNode.lazyLoadRequestResourceTimeout, 5s by default. If the search task is blocked by request resource, it can be killed by these timeout and ErrServiceResourceInsufficient will be returned from search call, handoff and other task can be executed.

verified by lru-verify-32663-yezhen image image

@wangting0128 fixed, please verify it on latest lru-dev commit.

chyezh avatar May 06 '24 07:05 chyezh

verification passed

image: master-20240514-f48a7ff8 argo task: lru-fouramf-wx7vx

截屏2024-05-15 10 46 19 截屏2024-05-15 10 44 16

wangting0128 avatar May 15 '24 02:05 wangting0128