milvus
milvus copied to clipboard
[Bug]: [benchmark][cluster][LRU] Search get stucked when handoff segments if memory usage of QueryNode is high.
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: lru-dev-fabf2da87fd2310e7e2ee76992d7ebacbdffc93e
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0rc66
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
https://grafana-4am.zilliz.cc/d/uLf5cJ3Ga/milvus2-0?orgId=1&var-datasource=P1809F7CD0C75ACF3&var-cluster=&var-namespace=qa-milvus&var-instance=lru-verify-32513-yezhen&var-collection=All&var-app_name=milvus&from=1713949871696&to=1713953123466
Search get stucked when handoff segments if memory usage of QueryNode is high.
Handoff get stucked at same time.
Expected Behavior
should not get stuck.
Steps To Reproduce
See #32513
Milvus Log
No response
Anything else?
No response
- When handoff happens, new segment search will trigger load operation in lru-cache.
- request resource for loading fails if Query node may be OOM.
[2024/04/24 09:34:21.414 +00:00] [WARN] [segments/segment_loader.go:819] ["no sufficient resource to load segments"] [traceID=e8995d1b669c5d25a412f289676bc9d1] [segmentIDs="[449296678400908743]"] [error="load segment failed, OOM if load, maxSegmentSize = 122.97617435455322 MB, memUsage = 7274.746287345886 MB, predictMemUsage = 7397.722461700439 MB, totalMem = 8192 MB thresholdFactor = 0.900000"]
- Then the searched segment is set with a notifier in a
waitQueue
for next check. - All search task get stucked with same reason, then the concurrent limit of query node scheduler is reached.
- No more search task can be executed, and the segments reference count on delegator for search task can not be release, then the release step of handoff cannot be executed.
/assign @chyezh /unassign
Already fixed by add fast-fail and timeout of request resource of lazyload.
Can be controlled by queryNode.lazyLoadRequestResourceTimeout
, 5s by default.
If the search task is blocked by request resource, it can be killed by these timeout and ErrServiceResourceInsufficient
will be returned from search call
, handoff and other task can be executed.
verified by lru-verify-32663-yezhen
@wangting0128 fixed, please verify it on latest lru-dev
commit.
verification passed
image: master-20240514-f48a7ff8 argo task: lru-fouramf-wx7vx