milvus
milvus copied to clipboard
[Bug]: Even though load failed: load segment failed, OOM if load, the queryNode is still oomkilled
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: PR-32529-20240423-896bc75cf
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
- deploy milvus with 4 queryNode
Limits:
cpu: 8
memory: 12Gi
Requests:
cpu: 4
memory: 12Gi
- load and return error:
c.load()
RPC error: [get_loading_progress], <MilvusException: (code=65535, message=show collection failed: load segment failed, OOM if load, maxSegmentSize = 601.2391881942749 MB, concurrency = 1, memUsage = 10549.75390625 MB, predictMemUsage = 11150.993094444275 MB, totalMem = 12288 MB thresholdFactor = 0.900000)>, <Time:{'RPC start': '2024-04-25 21:40:39.503093', 'RPC error': '2024-04-25 21:40:39.514397'}>
RPC error: [wait_for_loading_collection], <MilvusException: (code=65535, message=show collection failed: load segment failed, OOM if load, maxSegmentSize = 601.2391881942749 MB, concurrency = 1, memUsage = 10549.75390625 MB, predictMemUsage = 11150.993094444275 MB, totalMem = 12288 MB thresholdFactor = 0.900000)>, <Time:{'RPC start': '2024-04-25 21:29:55.900691', 'RPC error': '2024-04-25 21:40:39.514693'}>
RPC error: [load_collection], <MilvusException: (code=65535, message=show collection failed: load segment failed, OOM if load, maxSegmentSize = 601.2391881942749 MB, concurrency = 1, memUsage = 10549.75390625 MB, predictMemUsage = 11150.993094444275 MB, totalMem = 12288 MB thresholdFactor = 0.900000)>, <Time:{'RPC start': '2024-04-25 21:29:55.801436', 'RPC error': '2024-04-25 21:40:39.514858'}>
Traceback (most recent call last):
File "/home/zong/Downloads/pycharm-community-2023.2.5/plugins/python-ce/helpers/pydev/pydevconsole.py", line 364, in runcode
coro = func()
File "<input>", line 1, in <module>
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 419, in load
**kwargs,
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 135, in handler
@functools.wraps(func)
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 131, in handler
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 170, in handler
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 110, in handler
if (
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 74, in handler
"""
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 1143, in load_collection
if not _async:
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 135, in handler
@functools.wraps(func)
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 131, in handler
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 170, in handler
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 110, in handler
if (
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 74, in handler
"""
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 1163, in wait_for_loading_collection
while can_loop(time.time()):
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 135, in handler
@functools.wraps(func)
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 131, in handler
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 170, in handler
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 110, in handler
if (
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 74, in handler
"""
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 1262, in get_loading_progress
response = self._stub.GetLoadingProgress.future(request, timeout=timeout).result()
File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/client/utils.py", line 60, in check_status
raise MilvusException(status.code, status.reason, status.error_code)
pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=show collection failed: load segment failed, OOM if load, maxSegmentSize = 601.2391881942749 MB, concurrency = 1, memUsage = 10549.75390625 MB, predictMemUsage = 11150.993094444275 MB, totalMem = 12288 MB thresholdFactor = 0.900000)>
- Three queryNode oomkilled
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-88bjt 1/1 Running 152 (6h59m ago) 29h
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-9cgzt 1/1 Running 138 (11m ago) 29h
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-phfrq 1/1 Running 137 (11m ago) 29h
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-v4jhs 1/1 Running 121 (19m ago) 28h
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
- argo workflow before load: https://argo-workflows.zilliz.cc/archived-workflows/qa/a6d806bb-82e5-44ba-9772-6a8b2904668e?nodeId=laion1b-debug-issue-1
- partical logs: loki logs of laion-issue-debug
- pods:
laion-issue-debug-etcd-0 1/1 Running 0 2d1h 10.104.16.159 4am-node21 <none> <none>
laion-issue-debug-etcd-1 1/1 Running 0 2d1h 10.104.34.252 4am-node37 <none> <none>
laion-issue-debug-etcd-2 1/1 Running 0 2d1h 10.104.19.6 4am-node28 <none> <none>
laion-issue-debug-milvus-datanode-f9899bb4f-f75f4 1/1 Running 36 (32h ago) 45h 10.104.29.96 4am-node35 <none> <none>
laion-issue-debug-milvus-datanode-f9899bb4f-lwwlj 1/1 Running 0 46h 10.104.20.171 4am-node22 <none> <none>
laion-issue-debug-milvus-indexnode-65556ccb97-6th95 1/1 Running 6 (33h ago) 45h 10.104.26.103 4am-node32 <none> <none>
laion-issue-debug-milvus-indexnode-65556ccb97-k4ttg 1/1 Running 0 2d1h 10.104.30.115 4am-node38 <none> <none>
laion-issue-debug-milvus-mixcoord-94b949585-vxrvk 1/1 Running 0 46h 10.104.20.172 4am-node22 <none> <none>
laion-issue-debug-milvus-proxy-5bcbc8c9c7-zt76l 1/1 Running 0 2d1h 10.104.1.141 4am-node10 <none> <none>
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-88bjt 1/1 Running 152 (7h6m ago) 29h 10.104.13.47 4am-node16 <none> <none>
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-9cgzt 1/1 Running 138 (19m ago) 29h 10.104.5.202 4am-node12 <none> <none>
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-phfrq 1/1 Running 137 (18m ago) 29h 10.104.14.36 4am-node18 <none> <none>
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-v4jhs 1/1 Running 121 (26m ago) 28h 10.104.16.94 4am-node21 <none> <none>
laion-issue-debug-minio-0 1/1 Running 0 2d1h 10.104.16.162 4am-node21 <none> <none>
laion-issue-debug-minio-1 1/1 Running 0 2d1h 10.104.34.3 4am-node37 <none> <none>
laion-issue-debug-minio-2 1/1 Running 0 2d1h 10.104.19.7 4am-node28 <none> <none>
laion-issue-debug-minio-3 1/1 Running 1 (40h ago) 2d1h 10.104.28.53 4am-node33 <none> <none>
laion-issue-debug-pulsar-bookie-0 1/1 Running 0 2d1h 10.104.16.165 4am-node21 <none> <none>
laion-issue-debug-pulsar-bookie-1 1/1 Running 0 2d1h 10.104.34.5 4am-node37 <none> <none>
laion-issue-debug-pulsar-bookie-2 1/1 Running 0 2d1h 10.104.19.11 4am-node28 <none> <none>
laion-issue-debug-pulsar-bookie-init-t5gbg 0/1 Completed 0 2d1h 10.104.13.195 4am-node16 <none> <none>
laion-issue-debug-pulsar-broker-0 1/1 Running 1 (7h6m ago) 2d1h 10.104.13.55 4am-node16 <none> <none>
laion-issue-debug-pulsar-proxy-0 1/1 Running 0 2d1h 10.104.16.139 4am-node21 <none> <none>
laion-issue-debug-pulsar-pulsar-init-9sqzd 0/1 Completed 0 2d1h 10.104.13.196 4am-node16 <none> <none>
laion-issue-debug-pulsar-recovery-0 1/1 Running 0 45h 10.104.4.58 4am-node11 <none> <none>
laion-issue-debug-pulsar-zookeeper-0 1/1 Running 0 2d1h 10.104.16.166 4am-node21 <none> <none>
laion-issue-debug-pulsar-zookeeper-1 1/1 Running 0 2d1h 10.104.27.78 4am-node31 <none> <none>
laion-issue-debug-pulsar-zookeeper-2 1/1 Running 0 2d1h 10.104.28.60 4am-node33 <none> <none>
Anything else?
No response
/assign @longjiquan /unassign
/assign @xiaocai2333 /unassign @longjiquan
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.