milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: v2.4.0 datanode 内存使用过高

Open yesyue opened this issue 9 months ago • 9 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: v2.4.0
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    kafka 
- SDK version(e.g. pymilvus v2.0.0rc2): 2.7
- OS(Ubuntu or CentOS):  CentOS
- CPU/Memory: 544c /4291.6 G at least
- GPU:  0 
- Others: datanode

Current Behavior

参考sizing tools 分配Data Node , 2 core 8 GB x 2pods , 实际运行出现OOM , 扩容后内存占用达40G

Expected Behavior

参考sizing tools 分配Data Node , 2 core 8 GB x 2pods , 实际运行出现OOM , 扩容后内存占用达40G

Steps To Reproduce

参考sizing tools 分配Data Node , 2 core 8 GB x 2pods , 实际运行出现OOM , 扩容后内存占用达40G

Milvus Log

No response

Anything else?

No response

yesyue avatar Apr 29 '24 05:04 yesyue

The title and description of this issue contains Chinese. Please use English to describe your issue.

github-actions[bot] avatar Apr 29 '24 05:04 github-actions[bot]

Referring to the Sizing Tools, allocate Data Nodes with 2 cores of 8 GB x 2 pods. However, during actual operation, the Data Nodes was an OOM, and after expansion, the memory usage reached 40G.

yesyue avatar Apr 29 '24 05:04 yesyue

datanode log:

datanode.log

yesyue avatar Apr 29 '24 05:04 yesyue

@yesyue please share more info about how you using milvus, e.g. what kinds of requests did you call to milvus, how many, and how frequency of them? also please help all the milvus pods logs for invesgitaion.

/assign @yesyue /unassign

yanliang567 avatar Apr 29 '24 06:04 yanliang567

100 Million/day entites write to milvus

yesyue avatar Apr 29 '24 06:04 yesyue

100 Million/day entites write to milvus

after I inserted 10M entites total, then milvus docker stop and crash. I use IVF_SQ8 index, installed milvus with gpu. I use batch insert 10000 (only insert if enough 10000 entities.

after crash I can't connect to connection again and can't use anything. Any solution?

tadinhkien99 avatar Apr 29 '24 06:04 tadinhkien99

  1. seems that flush can not catch up the read.
  2. how many partitions do you have? if you have many partitions or collections, the flush and memory consumption will be larger than estimation.
  3. there is bunch of configs to tune, like concurrent flush number -> dataNode.dataSync.maxParallelSyncMgrTasks (for 2.4) memory used for growing segment

xiaofan-luan avatar Apr 29 '24 14:04 xiaofan-luan

100 Million/day entites write to milvus

after I inserted 10M entites total, then milvus docker stop and crash. I use IVF_SQ8 index, installed milvus with gpu. I use batch insert 10000 (only insert if enough 10000 entities.

after crash I can't connect to connection again and can't use anything. Any solution?

how much gpu memory do you have? please open another issue with detailed logs so we can help

xiaofan-luan avatar Apr 29 '24 14:04 xiaofan-luan

querynode (3).log

yesyue avatar May 04 '24 02:05 yesyue

querynode (3).log

1.could you offer log for datanode? 2. it would be great if you have a datanode pprof, so you know which part takes of your memory. Most likely it's insert buffer takes the memory and you can tune the flush parameter

xiaofan-luan avatar May 05 '24 14:05 xiaofan-luan

I saw you in many issues and we'd like to offer help. feel free to contact me at [email protected] if necessary

xiaofan-luan avatar May 05 '24 14:05 xiaofan-luan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jun 05 '24 01:06 stale[bot]