milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Even though load failed: load segment failed, OOM if load, the queryNode is still oomkilled

Open ThreadDao opened this issue 1 year ago • 2 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: PR-32529-20240423-896bc75cf
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. deploy milvus with 4 queryNode
    Limits:
      cpu:     8
      memory:  12Gi
    Requests:
      cpu:      4
      memory:   12Gi
  1. load and return error:
c.load()
RPC error: [get_loading_progress], <MilvusException: (code=65535, message=show collection failed: load segment failed, OOM if load, maxSegmentSize = 601.2391881942749 MB, concurrency = 1, memUsage = 10549.75390625 MB, predictMemUsage = 11150.993094444275 MB, totalMem = 12288 MB thresholdFactor = 0.900000)>, <Time:{'RPC start': '2024-04-25 21:40:39.503093', 'RPC error': '2024-04-25 21:40:39.514397'}>
RPC error: [wait_for_loading_collection], <MilvusException: (code=65535, message=show collection failed: load segment failed, OOM if load, maxSegmentSize = 601.2391881942749 MB, concurrency = 1, memUsage = 10549.75390625 MB, predictMemUsage = 11150.993094444275 MB, totalMem = 12288 MB thresholdFactor = 0.900000)>, <Time:{'RPC start': '2024-04-25 21:29:55.900691', 'RPC error': '2024-04-25 21:40:39.514693'}>
RPC error: [load_collection], <MilvusException: (code=65535, message=show collection failed: load segment failed, OOM if load, maxSegmentSize = 601.2391881942749 MB, concurrency = 1, memUsage = 10549.75390625 MB, predictMemUsage = 11150.993094444275 MB, totalMem = 12288 MB thresholdFactor = 0.900000)>, <Time:{'RPC start': '2024-04-25 21:29:55.801436', 'RPC error': '2024-04-25 21:40:39.514858'}>
Traceback (most recent call last):
  File "/home/zong/Downloads/pycharm-community-2023.2.5/plugins/python-ce/helpers/pydev/pydevconsole.py", line 364, in runcode
    coro = func()
  File "<input>", line 1, in <module>
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 419, in load
    **kwargs,
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 135, in handler
    @functools.wraps(func)
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 131, in handler
    
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 170, in handler
    
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 110, in handler
    if (
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 74, in handler
    """
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 1143, in load_collection
    if not _async:
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 135, in handler
    @functools.wraps(func)
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 131, in handler
    
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 170, in handler
    
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 110, in handler
    if (
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 74, in handler
    """
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 1163, in wait_for_loading_collection
    while can_loop(time.time()):
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 135, in handler
    @functools.wraps(func)
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 131, in handler
    
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 170, in handler
    
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 110, in handler
    if (
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/decorators.py", line 74, in handler
    """
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 1262, in get_loading_progress
    response = self._stub.GetLoadingProgress.future(request, timeout=timeout).result()
  File "/home/zong/zong/projects/milvus/tests20/python_client/venv/lib/python3.8/site-packages/pymilvus/client/utils.py", line 60, in check_status
    raise MilvusException(status.code, status.reason, status.error_code)
pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=show collection failed: load segment failed, OOM if load, maxSegmentSize = 601.2391881942749 MB, concurrency = 1, memUsage = 10549.75390625 MB, predictMemUsage = 11150.993094444275 MB, totalMem = 12288 MB thresholdFactor = 0.900000)>
  1. Three queryNode oomkilled
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-88bjt             1/1     Running                       152 (6h59m ago)   29h
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-9cgzt             1/1     Running                       138 (11m ago)     29h
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-phfrq             1/1     Running                       137 (11m ago)     29h
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-v4jhs             1/1     Running                       121 (19m ago)     28h

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

  • argo workflow before load: https://argo-workflows.zilliz.cc/archived-workflows/qa/a6d806bb-82e5-44ba-9772-6a8b2904668e?nodeId=laion1b-debug-issue-1
  • partical logs: loki logs of laion-issue-debug
  • pods:
laion-issue-debug-etcd-0                                          1/1     Running                       0                 2d1h    10.104.16.159   4am-node21   <none>           <none>
laion-issue-debug-etcd-1                                          1/1     Running                       0                 2d1h    10.104.34.252   4am-node37   <none>           <none>
laion-issue-debug-etcd-2                                          1/1     Running                       0                 2d1h    10.104.19.6     4am-node28   <none>           <none>
laion-issue-debug-milvus-datanode-f9899bb4f-f75f4                 1/1     Running                       36 (32h ago)      45h     10.104.29.96    4am-node35   <none>           <none>
laion-issue-debug-milvus-datanode-f9899bb4f-lwwlj                 1/1     Running                       0                 46h     10.104.20.171   4am-node22   <none>           <none>
laion-issue-debug-milvus-indexnode-65556ccb97-6th95               1/1     Running                       6 (33h ago)       45h     10.104.26.103   4am-node32   <none>           <none>
laion-issue-debug-milvus-indexnode-65556ccb97-k4ttg               1/1     Running                       0                 2d1h    10.104.30.115   4am-node38   <none>           <none>
laion-issue-debug-milvus-mixcoord-94b949585-vxrvk                 1/1     Running                       0                 46h     10.104.20.172   4am-node22   <none>           <none>
laion-issue-debug-milvus-proxy-5bcbc8c9c7-zt76l                   1/1     Running                       0                 2d1h    10.104.1.141    4am-node10   <none>           <none>
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-88bjt             1/1     Running                       152 (7h6m ago)    29h     10.104.13.47    4am-node16   <none>           <none>
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-9cgzt             1/1     Running                       138 (19m ago)     29h     10.104.5.202    4am-node12   <none>           <none>
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-phfrq             1/1     Running                       137 (18m ago)     29h     10.104.14.36    4am-node18   <none>           <none>
laion-issue-debug-milvus-querynode-1-7cc9f6bb79-v4jhs             1/1     Running                       121 (26m ago)     28h     10.104.16.94    4am-node21   <none>           <none>
laion-issue-debug-minio-0                                         1/1     Running                       0                 2d1h    10.104.16.162   4am-node21   <none>           <none>
laion-issue-debug-minio-1                                         1/1     Running                       0                 2d1h    10.104.34.3     4am-node37   <none>           <none>
laion-issue-debug-minio-2                                         1/1     Running                       0                 2d1h    10.104.19.7     4am-node28   <none>           <none>
laion-issue-debug-minio-3                                         1/1     Running                       1 (40h ago)       2d1h    10.104.28.53    4am-node33   <none>           <none>
laion-issue-debug-pulsar-bookie-0                                 1/1     Running                       0                 2d1h    10.104.16.165   4am-node21   <none>           <none>
laion-issue-debug-pulsar-bookie-1                                 1/1     Running                       0                 2d1h    10.104.34.5     4am-node37   <none>           <none>
laion-issue-debug-pulsar-bookie-2                                 1/1     Running                       0                 2d1h    10.104.19.11    4am-node28   <none>           <none>
laion-issue-debug-pulsar-bookie-init-t5gbg                        0/1     Completed                     0                 2d1h    10.104.13.195   4am-node16   <none>           <none>
laion-issue-debug-pulsar-broker-0                                 1/1     Running                       1 (7h6m ago)      2d1h    10.104.13.55    4am-node16   <none>           <none>
laion-issue-debug-pulsar-proxy-0                                  1/1     Running                       0                 2d1h    10.104.16.139   4am-node21   <none>           <none>
laion-issue-debug-pulsar-pulsar-init-9sqzd                        0/1     Completed                     0                 2d1h    10.104.13.196   4am-node16   <none>           <none>
laion-issue-debug-pulsar-recovery-0                               1/1     Running                       0                 45h     10.104.4.58     4am-node11   <none>           <none>
laion-issue-debug-pulsar-zookeeper-0                              1/1     Running                       0                 2d1h    10.104.16.166   4am-node21   <none>           <none>
laion-issue-debug-pulsar-zookeeper-1                              1/1     Running                       0                 2d1h    10.104.27.78    4am-node31   <none>           <none>
laion-issue-debug-pulsar-zookeeper-2                              1/1     Running                       0                 2d1h    10.104.28.60    4am-node33   <none>           <none>

Anything else?

No response

ThreadDao avatar Apr 25 '24 13:04 ThreadDao

/assign @longjiquan /unassign

yanliang567 avatar Apr 26 '24 01:04 yanliang567

/assign @xiaocai2333 /unassign @longjiquan

yanliang567 avatar Apr 26 '24 01:04 yanliang567

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jun 16 '24 04:06 stale[bot]